A few weeks ago, while perusing the New York Times I came across the article, “How Tech Giants Cut Corners to Harvest Data for A.I.” The article was written by a team of five reporters located in San Francisco, Washington, and New York.
The main gist of the article is large AI companies including OpenAI, Google, and Meta are running out of training content for their AI engines. Before we talk about what they did next, let’s think about the implications of this.
Not Enough Content
Running out of content means that these companies have ingested all of the “reputable” English content on the internet. ALL of it. And they need more. The New York Times states that leading systems have ingested as many as three trillion words. And it is not enough.
According to Epoch, a research institute referenced in the article, companies will run out of content on the internet as soon as the year 2026. Content is being used faster than it is being produced.
That’s a bit mind-boggling. The entire English-speaking world is not producing enough content to keep up with the demands for AI training.
So what’s an AI mega-giant to do?
READ MORE: How to improve your AI performance
The Wrong Content
I read about two courses of action that companies are either taking or contemplating.
The first is that companies knowingly began to violate copyright and licensing rules. Using a variety of techniques, they scraped YouTube content, used copyrighted news articles (from the New York Times even), podcasts, audiobooks, and even quizlets for content.
But even this wasn’t enough. Google started thinking about all of the content users store in its free consumer apps, such as Google Docs, Google Sheets, and Google Slides.
Google went so far as to change its privacy policy for its free consumer apps. While no one was looking, over the Fourth of July weekend, Google’s privacy policy changed as follows:
Did you notice this change? I’m sure we were all informed, but I somehow missed it until the New York Times brought it to my attention.
The second technique that the mega-giants have been considering is what they call synthetic data. Synthetic data is content that AI systems generate. In other words, the AI systems will generate text to train AI systems.
Say it with me, “What could possibly go wrong?”
Download our free ebook AI In Your Pocket
Is Your Content Ready for AI?
Most of our customers are looking at training their AI solutions on their internal content. They don’t need their internal GenAI to learn from a philosophy student’s college essays in their Google Docs.
Rather, they need their employees and customers to be able to find, access, and understand information pertinent to their business. They need the GenAI to produce summary content that does not require more time to edit than it would have required to write from scratch.
To train the AI well, the training content must be cleaned up. The more redundancy, inaccuracy, and inconsistency throughout your content, the less the AI is able to provide useful results. More and more, I’m talking to customers about how to tackle this problem so they get the most value out of their technology investment.
All the things we do to make content better for humans – componentize it, structure it, clean up the grammar, tag it with metadata – make content better for machines as well. And my best advice: good writing is the secret to optimizing your AI results.
Want more helpful insights into AI and content strategy? Subscribe to the Content Rules newsletter, where you’ll get thoughtful advice from our experts.
- How to Make Conferences More Inclusive for the Hard of Hearing Community - December 2, 2024
- Preparing Content for AI: 6 Reasons Why You’re Not Ready - August 29, 2024
- How to Be Inclusive in the Workplace: My Experience as a Hard of Hearing Person - August 12, 2024