Last month, OpenAI showed the world their Sora video generation AI to mixed response. Some declared that Sora would render film studios useless and revolutionise the entertainment industry, while others, including myself, pointed out that, like all generative AI, it still has some major uncanny valley issues. But the realism of the videos isn’t Sora’s biggest problem. You see, OpenAI CTO Mira Murati did a video interview with the Wall Street Journal’s Joanna Stern to promote Sora, but she inadvertently highlighted Sora’s Achilles heel to the world.
In the interview (which you can watch here), Stern asked Murati where OpenAI got the data to train such a powerful AI. Murati gave possibly the worst answer possible; “We used publicly available data and licensed data.” This is about as vague an answer as you could provide, and Stern, being a good journalist, pushed for more clarity by asking if publicly available data meant YouTube videos. Murati’s response to this was somehow worse than before, stating, “I’m actually not sure about that.” Being the CTO (Chief Technical Officer), making decisions about such details is literally her job; she knows but isn’t letting it slip. After more pressure from Stern, who asked if OpenAI used video data from Facebook and Instagram for Sora, Murati said, “I’m not sure. I’m not confident about it.” After more pressure, she changed her tune to, “I’m just not going to go into detail about the data that was used.”
Murati might as well have held a giant sign that read, “We are guilty of scraping video data from YouTube, Facebook and Instagram.”
But why does that matter?
AI companies have to skirt copyright laws to gather the vast amount of data they require to train their AIs. Take my articles; if you copied and pasted them somewhere else, that would be a breach of copyright. But what if an AI company coped it and used it to train a chatbot AI, is that a violation of copyright? Personally, I think it should be. Just because articles, videos or other media are publicly available doesn’t mean they can be used for AI training without permissions and proper compensation.
Why? Well, because AI can’t create something new or have an original thought. It uses data to replicate something based on user demands. In other words, if an AI is trained on a creator’s work, it can replicate their work. But calling this “replication” is generous; it suggests there is an aspect of inspiration spawning a new inspired work, which is a legitimate form of media creation. Instead, the AI mashes bits of the creator’s original work using statistics to make a Frankenstein copycat. So, there is no “fair use” at pay (read about fair use here). AI just straight-up copies what’s in their training data.
But this also means that if an AI isn’t trained on a creator’s work, it can’t replicate it.
Let me put it another way. If I downloaded and uploaded a MrBeast video to my YouTube account but applied a blue hue to the video, it would rightly be taken down for copyright. I can’t claim fair use for the uploaded video, as I haven’t actually added anything of substance to the original. If you use Sora to create a MrBeast episode, it will use MrBeast video data that OpenAI scrapped from YouTube and cobble it together into something that resembles a MrBeast video. I, and many others, would argue that this doesn’t classify as adding anything of substance to the original data, so fair use isn’t applicable, and therefore, OpenAI needs to correctly license the data. Particularly as there is a huge question mark over whether non-sentient software can have any creative input.
Artists of all forms are also heavily pushing for this licensing, as AI can pump out work far faster than they can, flooding the market and devaluing their work while making the AI company vast piles of money. Surely, as the AI company is heavily profiting from their work, they are deserving of renumeration?
But here is the crux, AI models require a truly vast amount of data to function. More complex AIs like Sora require even more data than the already data-hungry chatbots we have become accustomed to recently. If AI companies correctly paid for all the data they used, AIs would be far too expensive to develop. As it currently stands, AI can only function by violating the basic rules of Capitalism and committing digital robbery. Now, some places in the world have laws against this, such as GDPR in the EU. It is far from watertight or comprehensive, but such legislation means AIs like Sora might not be available in the EU.
There are hundreds of thousands of creators on Facebook, Instagram, and YouTube who earn income from their videos. Moreover, they also make Meta and Google a truly astonishing amount of money each year. If an AI like Sora disrupts this creator economy without correctly compensating those whose data it was trained on, there could be a huge backlash. Creators will flee the sites and start others that block AI from scrapping their data, and in return, Meta and Google might start taking action against companies like OpenAI.
In other words, even if Sora could reproduce video perfectly without us noticing (which it can’t, by the way), it would still be a limited tool because, without proper compensation for those whose creativity and hard work trained the AI, its output will be too controversial to use. So, while Murati’s interview might seem just a little cringeworthy on the surface, the repercussions of what she is inadvertently alluding to are disastrous for everyone involved.
Thanks for reading! Content like this doesn’t happen without your support. So, if you want to see more like this, don’t forget to Subscribe and follow me on BlueSky or X and help get the word out by hitting the share button below.
Sources: Futurism, The Decoder, PetaPixel, Mashable, Computer.org, The Guardian, Securiti, IEEE, WSJ