Are AI companies using copyrighted data? • The Midas Project

The current era of training large AI models requires three fundamental requirements: advanced algorithms, advanced computer chips, and a lot of data. This last component has become a sticking point for AI companies in recent years, including OpenAI, Anthropic, Meta, and Google.

These companies have essentially hovered up the entire internet in their fight to outcompete each other by building larger and larger — and more and more capable — AI products. This data includes news articles, books, videos, social media posts, and more.

But this poses an important question: Have these companies ensured they are accessing data fairly and responsibly? So far, all available evidence points to a firm and resounding “no.”

Has copyrighted data been used in the creation of advanced AI models?

Last week, the New York Times reported that big tech has been found to be cutting corners in a major way while training the current and upcoming generations of advanced AI.

For example, OpenAI decided to transcribe a vast amount of YouTube videos to use in the training process for their current frontier model, GPT-4. As you may have already guessed, this is not only a violation of YouTube’s terms of service, which prohibit the use of YouTube data to build external products and the automated “scraping” of such data, but it may also be a violation of copyright and privacy laws. Any video uploaded to YouTube may have been used (without adequate compensation) to train OpenAI’s most profitable product to date.

So why didn’t Google step in to address OpenAI’s use of YouTube data? The New York Times article reports that it may have been because they were planning to do the same thing in building their own AI models.

This isn’t the first time that big tech companies have shown blatant disregard for ethical acquisition and use of copyrighted data to train their models. There is reason to believe that they have used copyrighted data to train earlier models, including datasets full of copyrighted books.

Is the use of copyrighted data in the production of advanced AI systems fair?

It’s hard to understand exactly what is going on inside an AI model when it is trained on real-world data. At the most basic level, it is practicing predicting the content of that data to get better at predicting similar material in the future.

Many AI companies will argue that this is fair use and, in some ways, analogous to what humans do when they consume content. It’s only by listening to lots of Jazz music, for example, that humans develop an intuition for what Jazz sounds like and become competent at making their own. Indeed, advanced AI models can make original work in the style of real-world data they have consumed.

However, they are also capable of outright plagiarizing real-world data. The New York Times is also currently suing OpenAI for their uncompensated use of New York Times articles in training models like GPT-4. Key to their argument is that, when asked, the OpenAI models will rewrite specific New York Times articles nearly verbatim, threatening their business model and the future of sustainable journalism (after all, why would a consumer subscribe to a newspaper when they could read the same articles for free by asking ChatGPT?)

Do these AI models threaten the livelihood of artists, musicians, and creators?

Even if AI models can one day be prevented from re-creating copyright materials verbatim (as in the case of the New York Times lawsuit), they may still threaten the livelihood of the creators whose work they were trained on.

Earlier this year, OpenAI unveiled Sora — a video generation model that can produce lifelike footage in response to simple text prompting. It is virtually certain that this model was trained using vast amounts of footage originally created by human filmmakers. However, it may soon be cheaper and more accessible for companies to use Sora for video generation than to hire real human artists — threatening their income.

The same can be said for writers, as in the case of ChatGPT, and musicians, as demonstrated by the recent release of Suno and Udio (music generation services that create uncannily realistic songs in various genres). Many will argue that creators have not been fairly compensated for the value they’ve offered these AI companies — the same companies that now threaten their ability to earn an income through their craft.

What can be done to ensure fair AI development?

Perhaps the most significant problem underlying big tech’s race to consume and transform all the human-generated data on the internet is that they are undertaking it unilaterally. There was no democratic deliberation. A handful of companies and the tech CEOs who led them made the independent choice to use vast amounts of data — often without compensating the creators of that data — with a considerable amount to personally gain.

We may eventually want to live in a future with superhuman AI systems capable of creating beautiful writing, artwork, and music from a simple text prompt. In fact, such tools may empower human creators by giving them a wider, and far more powerful, palette of tools for creating art and writing. But if we decide to follow that path, it needs to be the result of democratic deliberation, and not the profit-motivated scrambling of a handful of tech companies.

The Midas Project is fighting to hold tech companies accountable and demand the fair and responsible use of data in training their products. We’d love to have you on board. Click the button below to find ways to get involved in this fight.