OpenAI Just Laid Out AI's Biggest Flaw For All To See

Generative AI has a massive data security issue.

Jul 08, 2024

OpenAI is at the forefront of the AI world. Its products, like ChatGPT, are by far the most used AI products worldwide. But, this popularity also entails increased scrutiny. Until recently, OpenAI has been able to keep any controversy relatively contained, bar from a few copyright issues. However, recent revelations have not only painted OpenAI in an incredibly problematic light but also raised serious questions about the rest of the generative AI industry.

Let’s start with the first revelation, which came from a New York Times exclusive report that in 2023, OpenAI had a major data breach and failed to inform anyone outside the company of it. OpenAI’s reason for this was that the breach didn’t impact their Microsoft products and that the hacker acted as an individual and not part of a foreign government.

So, why is this a problem? Well, as we will cover later on, OpenAI holds an utterly vast amount of private, sensitive and incredibly well-organised data. As such, this hacker could have walked away with terabytes of data that could damage ChatGPT users. They could also effortlessly extract such damaging information from such a vast database thanks to its meticulous organisation. As such, it doesn’t matter whether the hacker targeted OpenAI’s Microsoft products or was an individual. That hacker now has potentially incredibly lucrative data and could easily sell it to nefarious groups.

So, why did they keep this breach quiet? Well, OpenAI’s products deeply depend on gathering as much data as possible to train their AIs, and as such, they can’t be seen to have data security issues, as such a reputation could limit their ability to gather data.

But this isn’t the only security issue at OpenAI, and this brings me to the second recent revelation that OpenAI’s ChatGPT macOS app has a glaring security flaw. Engineer José Pereira Vieito tweeted on July 2nd that the application was programmed to side-step the Mac’s inbuilt sandboxing that prevents apps from exposing private data and, as such, stored all user conversations in plain text in an unsecured, unencrypted directory. This meant that a hacker could easily access data on how a ChatGPT user uses the system.

Many users use ChatGPT as a therapist or agony aunt or use the app to massively aid in their work, even when their employer explicitly bans it or their work is incredibly sensitive. For example, many programmers use ChatGPT to automate coding. As such, this easily hackable data could be used to blackmail or severely damage the user or parts of their life they are connected to.

Such security flaws are nothing new. For example, social media sites like Facebook have a long history of data breaches in which hundreds of millions of users’ private data is leaked. So, why is OpenAI’s recent security issues such a damning indictment of OpenAI and the AI industry as a whole?

Well, OpenAI and other generative AI companies use their data very differently from other tech companies. They train the AIs on this data. This means the database needs to be incredibly high quality, with each point being well-organised and exceptionally well labelled, enabling the AI to correctly “understand” what is in the database. If OpenAI and other companies only used public data to train their AI, such leaks would only be an issue for OpenAI. You see, they have to invest massive amounts of money in a vast human workforce to organise, sort and label this database. If it gets leaked, OpenAI’s rivals could simply use this already-processed database and save themselves a tonne of money!

But pretty much all generative AI companies use their user’s data to train their AIs. This makes complete sense, as it is by far the easiest form of data for them to collect, but it is also incredibly valuable in honing the AI as this data includes feedback in the form of prompts correcting mistakes in the AI’s output. Not training their AI on this data would set OpenAI or any other generative AI company at a severe disadvantage.

So, generative AI companies are incredibly incentivised to use sensitive user data that, thanks to how users are using these services, is potentially far more sensitive than data held by most large tech companies, and store it in a far more organised, labelled and searchable manner than any other tech company does. As such, generative AI companies are inherently a gold mine for hackers.=

You’d think this would make data security a huge focus for OpenAI and other generative AI companies. But, no.

You see, leading generative AI companies like OpenAI must run incredibly lean to stay ahead of the ever-growing competition. As such, funds are instead spent on infrastructure build-out, data harvesting and AI development, with security very much being pushed far down the list of priorities. Again, these generative AI companies are incentivised to forgo proper security measures, as they can grow far quicker by focusing elsewhere. This is why ChatGPT’s macOS app had such a glaring flaw that even a beginner programmer could have spotted and easily fixed.

So, while the recent revelations of significant data breaches and immense data security issues are seriously worrying for OpenAI, they are actually indicative of the entire generative AI industry. So, how do we solve this? Well, either generative AI companies have to find a way to solve their inherent flaws, or governments will have to step in and ensure the data they collect is ethical and adequately protected, or they will have to accept that a sizeable number of potential users will refuse to use their products.

Thanks for reading! Content like this doesn’t happen without your support. So, if you want to see more like this, don’t forget to Subscribe and help get the word out by hitting the share button below.

Sources: The Register, Tech Crunch, Fortune, Cyberscoop

Will Lockett's Newsletter

Discussion about this post