Moving to Generative AI: Train Your LLM with Structured Content | Content Rules | Computer screen that has CHAT AI on it

Artificial intelligence (AI) used to be something we thought about as a future need or threat. Just a couple of years ago, the impact of AI on our daily lives was indirect. Companies employed AI systems that ran behind a user interface. Even though we have interacted with AI for several years, we did not necessarily know when AI was in the background.

But all that changed when OpenAI first launched ChatGPT to the public back in 2022. Since then, AI has gone from “the big, exciting, and scary future” to “AI is in my pocket.”

Generative AI and Large Language Models Defined

Even though GenAI and Large Language Models have been making big waves more recently, companies are trying to understand their implications. If you’re not familiar, here’s a quick primer:

GenAI is a class of AI applications that learn patterns from existing data and use that learning to create new, original content. An LLM is a type of GenAI that uses natural language understanding (NLU) to discern and generate human-like text. LLMs use an enormous amount of training content to instruct the NLU. LLMs are focused on generating human-like text. GenAI encompasses a much wider range of applications beyond generating text. GenAI also includes images, audio, video, and more.

READ MORE: Preparing content for AI – 6 reasons why you’re not ready

Is AI the One-Stop Solution?

From the outside, LLMs and GenAI look like the one-stop solution to all the problems that have plagued content creators and content seekers. If we can have LLMs create the content we need when we need it, have we reached content nirvana? I talk to people all day, every day, about GenAI and LLMs. They don’t know what to do or which way to go. They are confused and conflicted about the need for structured content. Their management is asking questions and those questions are going unanswered:

Should we move to GenAI and LLMs now, and totally skip the structured content step completely? Won’t that save us time and money? What could possibly go wrong?

What Could Possibly Go Wrong?

On July 6, 2023, the Harvard Business Review (HBR) published an article by Tom Davenport and Maryam Alavi called, “How to Train Generative AI Using your Company’s Data.” The article looks at three ways to train an LLM:

  1. Training from scratch
  2. Fine-tuning an existing LLM
  3. Prompt-tuning an existing LLM

Davenport and Alavi explain that prompt-tuning an existing LLM is likely to be the most common way companies will train an LLM.

Training from scratch is an enormous undertaking that is unnecessary for most companies, even if they have the resources to do it. Fine-tuning an existing LLM is not quite as resource-intensive as training from scratch. However, this method likewise requires massive amounts of data and “considerable data science expertise,” according to the authors.

Prompt-tuning leaves the original content in the LLM unaltered. Instead, the LLM is trained using different prompts that are typed in. The prompts enable the LLM to answer questions related to the content and to calibrate the results.

The method you use to train an LLM is important. But more important is the content that you use to do the training. In order to produce accurate results, the LLM needs to be trained with content that is “accurate, timely, and not duplicated,” according to Davenport and Alavi.

READ MORE: Try this if you’re attempting to improve your AI performance

Structured Content Is Essential to Training Your LLM

If you are using prompt training to improve your LLM, “unstructured data like text is likely to be to be too large with too many important attributes to enter it directly in the context window for the LLM,” say Davenport and Alavi.

Structured content made up of small components that can be tagged at a granular level is undoubtedly one of the best forms of content to use to train any type of AI system. In fact, structured content and data are the key regardless of the training method.

Davenport and Alavi cite the experiences of Morgan Stanley and Morningstar, stating;

“Morgan Stanley has also found that it is much easier to maintain high-quality knowledge if content authors are aware of how to create effective documents.”

Both knowing how to write the documents and how to tag the documents are important. The authors continue:

“Both Morgan Stanley and Morningstar trained content creators in particular on how best to create and tag content, and what types of content are well-suited to generative AI usage.”

Use Structured Content to Train Your AI Properly

When an LLM is trained using huge quantities of uncurated monolithic documents, it is highly likely that the training content contains contradictions and errors. The more this content is used for training, the more likely it is to be problematic.

When the LLM searches the content to prepare a response to a prompt, it uses whatever content it finds. The LLM cannot pick and choose which piece of contradictory content is accurate. It bases its response on what it finds and decides is the best response.

If you confuse the LLM enough by training it with lots of messy, redundant, contradictory content, the LLM is happy to hallucinate. This means it simply makes up responses. The responses are very well-written and grammatically accurate. But there is no guarantee that they are factually correct. They just look that way.

That’s why structuring and curating your content is at your own peril. Depending on how important accuracy is, you might get away with it. More likely, though, you should structure your content, curate it, and clean it up first. Then, you can use the content to train your AI and be confident in the results produced by the LLM.

Are you a Content Rules subscriber? Get more expert tips just like this, right in your inbox. 

Val Swisher