Is AI Really Running Out of Content?

A few weeks ago, while perusing the New York Times I came across the article, “How Tech Giants Cut Corners to Harvest Data for A.I.” The article was written by a team of five reporters located in San Francisco, Washington, and New York.

The main gist of the article is large AI companies including OpenAI, Google, and Meta are running out of training content for their AI engines. Before we talk about what they did next, let’s think about the implications of this.

Not Enough Content

Running out of content means that these companies have ingested all of the “reputable” English content on the internet. ALL of it. And they need more. The New York Times states that leading systems have ingested as many as three trillion words. And it is not enough.

According to Epoch, a research institute referenced in the article, companies will run out of content on the internet as soon as the year 2026. Content is being used faster than it is being produced.

That’s a bit mind-boggling. The entire English-speaking world is not producing enough content to keep up with the demands for AI training.

So what’s an AI mega-giant to do?

READ MORE: How to improve your AI performance

The Wrong Content

I read about two courses of action that companies are either taking or contemplating.

The first is that companies knowingly began to violate copyright and licensing rules. Using a variety of techniques, they scraped YouTube content, used copyrighted news articles (from the New York Times even), podcasts, audiobooks, and even quizlets for content.

But even this wasn’t enough. Google started thinking about all of the content users store in its free consumer apps, such as Google Docs, Google Sheets, and Google Slides.

Google went so far as to change its privacy policy for its free consumer apps. While no one was looking, over the Fourth of July weekend, Google’s privacy policy changed as follows:

Did you notice this change? I’m sure we were all informed, but I somehow missed it until the New York Times brought it to my attention.

The second technique that the mega-giants have been considering is what they call synthetic data. Synthetic data is content that AI systems generate. In other words, the AI systems will generate text to train AI systems.

Say it with me, “What could possibly go wrong?”

Download our free ebook AI In Your Pocket

Is Your Content Ready for AI?

Most of our customers are looking at training their AI solutions on their internal content. They don’t need their internal GenAI to learn from a philosophy student’s college essays in their Google Docs.

Rather, they need their employees and customers to be able to find, access, and understand information pertinent to their business. They need the GenAI to produce summary content that does not require more time to edit than it would have required to write from scratch.

To train the AI well, the training content must be cleaned up. The more redundancy, inaccuracy, and inconsistency throughout your content, the less the AI is able to provide useful results. More and more, I’m talking to customers about how to tackle this problem so they get the most value out of their technology investment.

All the things we do to make content better for humans – componentize it, structure it, clean up the grammar, tag it with metadata – make content better for machines as well. And my best advice: good writing is the secret to optimizing your AI results.

Want more helpful insights into AI and content strategy? Subscribe to the Content Rules newsletter, where you’ll get thoughtful advice from our experts.

Author
Recent Posts

Val Swisher

Val Swisher is the Founder and CEO of Content Rules, Inc. Val enjoys helping companies solve complex content problems. She is a well-known expert in content strategy, structured authoring, global content, content development, and terminology management. Val believes content should be easy to read, cost-effective to create and translate, and efficient to manage. Her customers include industry giants such as Google, Cisco, Visa, Facebook, Roche, and IBM. Her fourth book, “The Personalization Paradox: Why Companies Fail (and How to Succeed) at Creating Personalized Experiences at Scale,” was published in 2021 by XML Press.

Val is on the Advisory Board for the Technical Communications Program at the University of North Texas. When not working with customers or students, Val can be found sitting behind her sewing machine working on her latest quilt. She also makes a mean hummus.

Is AI Really Running Out of Content?

Not Enough Content

The Wrong Content

Is Your Content Ready for AI?

Preparing Content for AI: 6 Reasons...

How to Improve AI Performance? Do T...

The Secret to Success: Good Writing...

Preparing Content for Generative AI...

Moving to Generative AI: Train Your...

Prepare Your Content Today for Gene...

AI and Magical Thinking: Tips for H...

Why Structured Content is Essential...

our recent posts

Get exclusive access to Val’s thoughts and hand-picked content with our newsletter.

browse by topic