Moving to Generative AI: Train Your LLM with Structured Content

by Val Swisher | Aug 17, 2023

Artificial intelligence (AI) used to be something we thought about as a future need or threat. Just a couple of years ago, the impact of AI on our daily lives was indirect. Companies employed AI systems that ran behind a user interface. Even though we have interacted with AI for several years, we did not necessarily know when AI was in the background.

But all that changed when OpenAI first launched ChatGPT to the public back in 2022. Since then, AI has gone from “the big, exciting, and scary future” to “AI is in my pocket.”

Generative AI and Large Language Models Defined

Even though GenAI and Large Language Models have been making big waves more recently, companies are trying to understand their implications. If you’re not familiar, here’s a quick primer:

GenAI is a class of AI applications that learn patterns from existing data and use that learning to create new, original content. An LLM is a type of GenAI that uses natural language understanding (NLU) to discern and generate human-like text. LLMs use an enormous amount of training content to instruct the NLU. LLMs are focused on generating human-like text. GenAI encompasses a much wider range of applications beyond generating text. GenAI also includes images, audio, video, and more.

Is AI the One-Stop Solution?

From the outside, LLMs and GenAI look like the one-stop solution to all the problems that have plagued content creators and content seekers. If we can have LLMs create the content we need when we need it, have we reached content nirvana? I talk to people all day, every day, about GenAI and LLMs. They don’t know what to do or which way to go. They are confused and conflicted about the need for structured content. Their management is asking questions and those questions are going unanswered:

Should we move to GenAI and LLMs now, and totally skip the structured content step completely? Won’t that save us time and money? What could possibly go wrong?

What Could Possibly Go Wrong?

On July 6, 2023, the Harvard Business Review (HBR) published an article by Tom Davenport and Maryam Alavi called, “How to Train Generative AI Using your Company’s Data.” The article looks at three ways to train an LLM:

Training from scratch
Fine-tuning an existing LLM
Prompt-tuning an existing LLM

Davenport and Alavi explain that prompt-tuning an existing LLM is likely to be the most common way companies will train an LLM.

Training from scratch is an enormous undertaking that is unnecessary for most companies, even if they have the resources to do it. Fine-tuning an existing LLM is not quite as resource-intensive as training from scratch. However, this method likewise requires massive amounts of data and “considerable data science expertise,” according to the authors.

Prompt-tuning leaves the original content in the LLM unaltered. Instead, the LLM is trained using different prompts that are typed in. The prompts enable the LLM to answer questions related to the content and to calibrate the results.

The method you use to train an LLM is important. But more important is the content that you use to do the training. In order to produce accurate results, the LLM needs to be trained with content that is “accurate, timely, and not duplicated,” according to Davenport and Alavi.

Structured Content Is Essential to Training Your LLM

If you are using prompt training to improve your LLM, “unstructured data like text is likely to be to be too large with too many important attributes to enter it directly in the context window for the LLM,” say Davenport and Alavi.

Structured content made up of small components that can be tagged at a granular level is undoubtedly one of the best forms of content to use to train any type of AI system. In fact, structured content and data are the key regardless of the training method.

Davenport and Alavi cite the experiences of Morgan Stanley and Morningstar, stating;

“Morgan Stanley has also found that it is much easier to maintain high-quality knowledge if content authors are aware of how to create effective documents.”

Both knowing how to write the documents and how to tag the documents are important. The authors continue:

“Both Morgan Stanley and Morningstar trained content creators in particular on how best to create and tag content, and what types of content are well-suited to generative AI usage.”

Use Structured Content to Train Your AI Properly

When an LLM is trained using huge quantities of uncurated monolithic documents, it is highly likely that the training content contains contradictions and errors. The more this content is used for training, the more likely it is to be problematic.

When the LLM searches the content to prepare a response to a prompt, it uses whatever content it finds. The LLM cannot pick and choose which piece of contradictory content is accurate. It bases its response on what it finds and decides is the best response.

If you confuse the LLM enough by training it with lots of messy, redundant, contradictory content, the LLM is happy to hallucinate. This means it simply makes up responses. The responses are very well-written and grammatically accurate. But there is no guarantee that they are factually correct. They just look that way.

That’s why structuring and curating your content is at your own peril. Depending on how important accuracy is, you might get away with it. More likely, though, you should structure your content, curate it, and clean it up first. Then, you can use the content to train your AI and be confident in the results produced by the LLM.

Are you a Content Rules subscriber? Get more expert tips just like this, right in your inbox.

Author
Recent Posts

Val Swisher

Val Swisher is the Founder and CEO of Content Rules, Inc. Val enjoys helping companies solve complex content problems. She is a well-known expert in content strategy, structured authoring, global content, content development, and terminology management. Val believes content should be easy to read, cost-effective to create and translate, and efficient to manage. Her customers include industry giants such as Google, Cisco, Visa, Facebook, Roche, and IBM. Her fourth book, “The Personalization Paradox: Why Companies Fail (and How to Succeed) at Creating Personalized Experiences at Scale,” was published in 2021 by XML Press.

Val is on the Advisory Board for the Technical Communications Program at the University of North Texas. When not working with customers or students, Val can be found sitting behind her sewing machine working on her latest quilt. She also makes a mean hummus.

Moving to Generative AI: Train Your LLM with Structured Content

Generative AI and Large Language Models Defined

Is AI the One-Stop Solution?

What Could Possibly Go Wrong?

Structured Content Is Essential to Training Your LLM

Use Structured Content to Train Your AI Properly

Preparing Content for AI: 6 Reasons...

The Triage Triangle: Solving Comple...

How to Improve AI Performance? Do T...

Is AI Really Running Out of Content...

The Secret to Success: Good Writing...

Content Problems vs. Process Proble...

When in Doubt, Componentize...

The Content Developer’s Guide to Br...

The Difference Between a Structured...

Preparing Content for Generative AI...

Prepare Your Content Today for Gene...

7 Ways to Write for Content Reuse E...

Pharma Content Reuse...

Personalized Content in Pharma Is N...

Working with a Content Services Age...

AI and Magical Thinking: Tips for H...

Three Tips to Care for Your Contrac...

How to Find an Exceptional Content ...

Why You Should Engage a Company Whe...

3 Telltale signs that you need to u...

ChatGPT is Coming for Your Job...

Will Advances in AI Impact the Way ...

Silos Aren’t Going Away Any Time So...

What is Structured Content? Making ...

Lowering Risk in Pharma With Struct...

Single-source publishing to multipl...

What Pharma Labeling Content Can Te...

Pharma Deserves Better...

Structured Content for Pharma: Maki...

There’s No Such Thing as a Perfect ...

The CDS is Dead! Long Live the CDS!...

Special Statements in Writing: 4 Ru...

The CDS is Dead! Long Live the CDS!...

How to Create a Unified Content Exp...

The Five Dimensions of Content Stan...

Solving Pharma Content Challenges...

Manage the Information, Not (Just) ...

Solving the 6 Challenges of Medical...

From Startup to Structure...

Is Structured Content Ready for Pha...

Implementing Structured Authoring: ...

Transforming Legacy Content: Sooner...

Why Structured Content is Essential...

Content Transformation is Like a Cr...

The Personalization Paradox Book...

Project Overload? Here's How to Fin...

Why Content Transformation Is Worth...

Three Methods for Transforming Your...

How to Decide Which Content to Tran...

Content Transformation Starts Here...

Content Transformation eBook...

Staying in the Loop with Taxonomy...

How Your CCMS is Just Like an Insta...

What is Content, Anyway?...

The Tale of the Structured Closet O...

The Future of Content: The Content ...

From Basic to Beautiful: How to Cre...

Confessions of a DITA Virgin: Is Yo...

What is Content Optimization? ebook...

Hey Content Industry Professionals,...

Made in the USA: The Impact of the ...

Welcome to Content Development 2006...

Taxonomy and Terminology: The Cross...

8 Simple Rules to Globalize Your Co...

Terminology is (Still) Just Like La...

Three Best Practices for Creating G...

Social Media for Teenagers - A Less...

What a (Noun) Cluster...

Glossary Versus Terminology: What's...

Sneak Peek at the New and Improved ...

It's Like Totally ... huh? How The ...

Expand Your Vocabulary...

Marketing Content is Stupid...

The NY Times Agrees - Keep It Short...