We have grown accustomed to the insane hyperbolic proclamations of AI CEOs. From claiming their technology could wipe out humanity to stating they are close to producing a superintelligent AI, every time they open their mouth, unfounded manipulative PR drivel seems only to flow out. But, amongst all this bull shit, there is the occasional slip of the mask, and the reality of their delicate house of cards is exposed. This happened recently with Microsoft’s CEO AI Satya Nadella, who has been pivotal in their colossal investment in OpenAI. Nadella publically stated that copyright laws needed to be changed in order to allow AI companies to train their models on copyrighted data without fear of repercussions. If this isn’t a desperate plee for help, I don’t know what is.
But to explain why, we need to recap a few things first.
Let’s start with why AI needs copyrighted material.
All an AI does is identify trends in a dataset through a process known as “training” and then recognise these trends in other datasets or replicate those trends when prompted. Some AIs, like cancer-spoting medical AIs, only require a small dataset to become extremely accurate, as they only have one hyper-specific job to do, and their output is far simpler (positive or negative). As such, these AIs don’t require copyrighted material.
But it’s a whole other story for generative AIs, such as those produced by OpenAI. ChatGPT is effectively an AI hooked up to a predictive text machine. It takes trends it has found in the text it was trained on, and then uses that to figure out what word should come next to best meet your prompt, and then builds its output word by word. This a far broader application, with a far more flexible output, and as such, the AI needs exponentially more training to be able to do this even remotely accurately. In fact, OpenAI’s old and not very accurate ChatGPT-3 was trained on 300 billion words worth of text! Or equivalent to a third of the British Library, the largest library on the planet. What’s more, the training data for their newest models like ChatGPT-4o and Strawberry, are likely significantly larger than ChatGPT-3's.
When you consider that copyright is automatically given to a creator in the Western world, this data-hungry side of generative AI becomes a huge problem. Where do they get all this data from? It has to be high quality otherwise their AI won’t be accurate. They can’t create it in-house; that would cost trillions of dollars, and they are already burning through cash exponentially faster than they can make it. As such, they have to steal it. Scraping articles, blogs, forum posts, Tweets and video transcripts from the internet, as well as copying reams of traditionally published media like magazines and books.
Surely, this violates copyright law? To build advanced AI off someone else’s copyrighted work seems wrong.
Well, yes and no.
According to these AI companies, they can use this data under “fair use”. This is an exemption in copyright law that enables someone to use copyrighted material without permission or payment for criticism, news reporting, teaching, research or transformation. This is how many YouTube videos can use small clips of movies they are talking about; they aren’t simply replicating the material. Instead, they are being critical of it or reporting on it, and therefore they are allowed to use it without permission or payment.
AI companies claim that their models “transform” the copyrighted material and so fall under fair use and don’t need permission or payment to use this data. Under copyright law, transformative fair use is granted if your use of copyrighted material gives your secondary work a new meaning or message.
But, here is the thing: that simply isn’t the case. AI models like this can only replicate what they have been trained on. As such, they can only replicate the meaning or message of the data they have been trained on. In fact, I have seen this myself when playing with OpenAI’s models, as I was able to get them to almost exactly replicate sentences and images from material we know they were trained on. But there is another even more hilarious example of this. Do you remember when Google’s AI kept telling people to put glue in their Pizza recipes or to eat a certain number of rocks per day? Well, it was just repeating, almost verbatim, jokes that people posted on Reddit, which Google had scraped and used to train their AI, and this AI was now repeating them, this time only as serious suggestions.
There is zero genuine transformation going on here, only shoddy replication. Stating fair use as a cover for this digital daylight robbery is a pathetically weak protection from the law, and AI companies know this.
They have been able to shrug off smaller copyright holders’ claims against them for a while now by simply using their fiscal might and army of lawyers. But, the big boys are starting to bite back. Numerous large publications have filed legal proceedings against AI companies for using their copyrighted data, and many seem set to win. WB, Universal and Sony have also taken numerous AI companies to court. They want to stop their material from being used in this way, and is looking for $150,000 in compensation for each of their works that was used to train AI. That is enough to bankrupt these AI startups.
The tide is turning on generative AI companies. Their old paper-thin defence is falling apart. As people slowly learn how this AI works and how it is being used, they realise that it flagrantly violates the intention of AI law. You see, copyright law is meant to ensure the copyright holders won’t be drowned out by copycats and enable them to benefit from their labour. Using AI like this can only ever do that. These AI companies steal the value of people’s labour to build a model that can badly replicate their work but produce significantly more output than they ever could, drowning them out despite the sub-par quality.
This puts people like Satya Nadella in a tricky situation.
You see, he has poured literally tens of billions of dollars into OpenAI, which is only the largest AI company because they have stolen the most copyrighted content so far, and so have the largest and most accurate models (though, they are far from genuinely accurate). He also bet the future of Microsoft on this gigantic AI bet paying off.
Yet this looming legal threat could utterly destroy OpenAI.
The vast majority, as in over 99%, of the data used to train their models is copyrighted. Indeed, OpenAI has said it would be impossible to build their models without copyrighted material. So they can’t just stop using this data, as their models would simply stop working altogether.
They can’t pay for the copyright. That would potentially cost trillions of dollars (particularly if WB, Universal and Sony set a precedent). OpenAI is also already burning through billions of dollars a year, and struggling to stay afloat as it is.
The only option forward is to pressure US and EU lawmakers to modify copyright law in their favour and explicitly allow AI training under fair use. Indeed, this is what Japan has already done, and Nadella is pushing the US to adopt the same stance.
However, Japan did this under the notion that it was necessary to compete on a global scale. You see, a year ago, economists, market analysts and politicians believed that advanced generative AI would become a new cold war, and whatever economy had the best generative AIs would come to dominate the global market. As such, violating our rights to our work was a small price to pay for keeping our economies on top. In fact, a year ago there were many politicians and economists on both sides of the Atlantic pushing for the same copyright law changes as Nadella.
However, these politicians and economists have largely changed their minds. They have seen AI fail to find a use case in which it is significantly better than human output by any measure, and AI development has seriously stagnated despite investment in the technology spiking. As such, many lawmakers no longer see AI as the next economic battleground and so no longer see a justification to sway copyright in their favour. In fact, many have spun 180 and now want to tighten copyright law to explicitly stop AI training without permission to protect their economy’s information market.
So, Nadella calling for this change now is not only too little too late, but a desperate plea for help to not crush his $14 billion gamble on OpenAI’s house of stolen cards.
But even if he gets his way (after all, US law is very rarely implemented or re-written to restrict big business), it won’t be the end of his woes.
There is a second angle copyright holders can demand AI companies stop using their data, and/or get compensation. This was pioneered by David Millette, the owner of the YouTube Channel Legal Drive. He is suing OpenAI and Nvidia for “unjust enrichment”. Unjust enrichment is defined as “benefit by chance, mistake or another’s misfortune for which the one enriched has not paid or worked and morally and ethically should not keep. If the money or property received rightly should have been delivered or belonged to another, then the party enriched must make restitution to the rightful owner.” For a court to find there has been unjust enrichment, there must be three things. Firstly, the other party was enriched. Secondly, this happened at that party’s expense. Finally, that was against equity and good conscience to permit the other party to retain what is sought to be recovered. If all three of these are found to be true, then the court can order the other party to compensate the party.
This perfectly describes how AI companies use copyrighted data to build hundred billion dollar plus businesses without compensating the copyright holders. As such, Millette seems to have a damn good chance of holding OpenAI to account, even if copyright law changes in Nadella’s favour.
What’s more, even though Japan’s new copyright law allows AI training under fair use, its legal system also recognises “unjust enrichment” as grounds for compensation. In fact, it is an almost globally recognised law.
It’s no wonder Nadella and the generative AI world are collectively shitting their pants. They are losing money hand over fist. Their models aren’t getting any better as they are butting up against hard limitations of the technology. They are struggling to find enough paying customers. And now, it seems the foundation of their technology, the data they used to build these models, will either be taken away, or force them to fork over so much cash, that bankruptcy won’t be enough to settle the debts.
I have said it many times in this article already, but only to hammer it home. This is a house of cards, and it will fall eventually.
Thanks for reading! Content like this doesn’t happen without your support. So, if you want to see more like this, don’t forget to Subscribe and help get the word out by hitting the share button below.
Sources: Euronews, The Times, Will Lockett, Planet Earth & Beyond, Oxford Academic, The Guardian, BTLJ, Business Insider, Data Centre Dynamics, Reuters