>when the training set will get over-represented with cheap, fast, crappy code written by LLMs themselves
It's already happening. An MIT study came out last week that found that Amazon Mechanical Turk workers hired to do RLHF type training of models were using ChatGPT to select the best answer. And the web being polluted by AI generated content which then gets scraped into Common Crawl and other training data sets has been an issue for a couple of years now.
It's already happening. An MIT study came out last week that found that Amazon Mechanical Turk workers hired to do RLHF type training of models were using ChatGPT to select the best answer. And the web being polluted by AI generated content which then gets scraped into Common Crawl and other training data sets has been an issue for a couple of years now.