While AI is nowhere near as capable as many claim, it is nonetheless being integrated into our everyday lives. Particularly generative AI. For example, ChatGPT seems to now be integrated into every operating system known to man. Similar generative tools are becoming widespread in numerous industries, even if they don’t positively impact them. What’s more, AI-generated content is taking up more and more of the corners of the internet we call home. Whether we like it or not, every inch of our lives is being touched by this technology. So, the fact that new research suggests generative AI models could collapse into a state of meaningless babble in just a few years is deeply worrying. Let me explain.
To understand this issue, we first have to understand how AI works. So, let’s quickly recap that. AI works by training an artificial neural network on a vast amount of well-labeled data. This neural network finds trends in this data, enabling it to recognise similar patterns in other data, predict how other datasets might evolve, or extrapolate these trends to create something new but derivative of the original dataset.
Generative AI falls into the extrapolation camp of AI. Take ChatGPT4, for example. It was “trained” on 570 GB of carefully selected raw text files; that might not sound like a lot, but it is equivalent to 300 billion words! Or, to put it another way, if I wrote flat out 24/7, it would take me 14,270 years to write that number of words. So, when you query GPT4, it uses trends in this vast amount of data to statistically create an output that matches your request.
Okay, how does this lead to generative AI collapsing in just a few years?
Well, where do you think OpenAI got all that data to train ChatGPT4?
In short, they stole it off the web. Creating such an AI would be financially unfeasible if this data was attained ethically and correctly. So, instead, they just scrape data off the internet. For example, ChatGPT4’s training data came from books, Wikipedia, Reddit, Twitter and other public places online. This is the core of the problem.
You see, these public spaces are being flooded with AI-generated content, which is indistinguishable from human-made content in some circumstances. Twitter (X) is particularly bad, with some researchers finding some main-stream topic searches that returned 100% AI-generated content. What’s more, other generative AI bots liked and commented on much of this content. But it isn’t just Twitter. In 2022, Mark Zuckerberg stated that 15% of feed content was AI-generated and that the company expected these numbers to more than double by the end of 2023, with data suggesting they passed that metric by a wide margin. In fact, a recent analysis predicts that 90% of the content posted to the internet in 2026 will be AI-generated.
Models like ChatGPT4 must constantly be retrained on new data to stay relevant and useful. As such, this means generative AI is already starting to eat itself alive by being trained on its own output or other AI output.
Why is this a problem? Well, it means that they will start recognising patterns of AI generative content, not human-made content. This can lead to a rabbit hole of development, in which the AI is optimising itself in a counterproductive way. The patterns it sees within the AI content might also go directly against those it sees in human content, leading to incredibly erratic and unstable outputs, which could render the AI useless! This is known as model collapse.
This is where a recent study published in Nature comes in. The researchers found that it only takes a few cycles of training generative AI models on their own output to render them completely useless and output complete nonsense. In fact, one AI they tested only needed nine cycles of retaining it on its own output before the output was just a repetitive list of jackrabbits.
As such, by 2026, these generative AIs will likely be trained on data that is primarily of their own creation, and it will only take a few rounds of training on this data before these AIs fall apart.
Worse, we currently don’t have a solution to this issue. Some have suggested watermarking AI-generated content in a way AIs can recognise, but we can’t ensure this content doesn’t get used for training. However, much of the generative AI industry is based on the ability to pass off this content as human-made. For example, Twitter bots overwhelmingly want to convince you they are a human. As such, making this AI content so easy to identify (even if we can’t, a simple tool could find the watermark) could ruin the barely profitable industry. So, the AI industry is massively pushing against such solutions, all while they sprint towards model collapse.
This is the paradox of AI, the more we use it, the worse it will get. It’s also why we shouldn’t build our industries or digital social systems around this technology, as it could crumble away very soon, leaving our economy and digital lives like a hollowed-out rotten tree waiting for the next storm to topple it.
Thanks for reading! Content like this doesn’t happen without your support. So, if you want to see more like this, don’t forget to Subscribe and help get the word out by hitting the share button below.
Sources: The Independent, The Independent, Nature, Morning Brew, Yahoo, The Living Library, Planet Earth & Beyond, ABC