AI and Magical Thinking: Tips for How to Move to Large Language Models

by Val Swisher | Apr 27, 2023

Moving to AI can be challenging, but training your large language models with structured content can produce more accurate results. | Illustration of men sitting on a brain | Content Rules

By Val Swisher and Becca Morn

A funny thing is happening lately in our conversations with customers. Ever since ChatGPT and its competitors made their way into the public’s consciousness, customers have been thinking, “If we wait long enough, these AI-powered search engines will fix all of our woes.”

Wouldn’t it be great if that were the case? If somehow, automagically, all of the messy, poorly organized, inconsistent content would be fixed by a new technology without us having to do anything about it. Even better – we can keep writing crappy content in the same way as we always have, and these technologies will make it beautiful, readable, consistent, and easy to find all without us having to even click a mouse.

Indeed, magical thinking is just that – magical.

You Can’t Train an AI Engine With Crappy Content

And if you do, you shouldn’t expect anything other than crappy results.

Let’s start with Large Language Models (LLMs) such as the variety of GPTs, Ernie, Bert (I am not making that up), Llama, Lamda, and the rest. These LLMs are not equipped with machine learning (ML) capabilities. They know only what you teach them.

The old adage, garbage in/garbage out, applies just as well to LLMs as it does to any other search capability.

In response to the limitations of LLMs, many companies are now using Retrieval-Augmented Generation (RAG) engines. RAG improves the results of LLMs by tapping into external sources of data to augment the information in the LLM. In theory, the results of RAG queries are less likely to contain mistakes or hallucinations, where the AI makes up information seemingly out of nowhere.

The addition of a RAG to an AI solution is a great way to provide an outside system of checks and balances. However (there is always a “however”), the content that the RAG retrieves also has to be clean and accurate.

According to Sthanikam Santhosh, in his Medium article “How to improve RAG (Retrieval Augmented Generation),” there are a number of steps you should take to enhance the quality and accuracy of the results, including:

Clean and organize the content.
Remove extra information, such as metadata, text, and special characters that are not needed
Consolidate or deduplicate information to create a single source of truth
Remove inconsistent and redundant information
Chunk the content
Each chunk should contain information that would answer a single user question
Each chunk should contain all of the information that is needed to answer a single user question
Use the “right size” chunk for your content and needs
Embed models, particularly where specific terminology will make the results more accurate
The RAG compares the meaning of sentences.
By embedding terminology and using it consistently, the results will be more accurate
Pay attention to and spend time tuning prompts
Fun fact – LLMs focus more on text at the beginning and end of a prompt

Adding RAG systems to your LLM environment helps improve the accuracy of results to queries. However, this doesn’t mean that you can ignore the content that you use for LLM training. The cleaner the training content is, the better overall results you will get.

Tasks to Consider When Making the Move to an LLM

If you are making the move to an LLM with or without the addition of RAG, consider these tasks:

Tag all searchable content according to its type and context. Depending on finding specific words in titles or in the text is not enough, especially if the same terms are not used consistently.
A large index database of search term synonyms is required to support NLP, and such requires ongoing, continuous maintenance and updating.
Organize content using a consistent and logical structure, such as headings, sections, and metadata. Structured content allows the NLP search engine to understand the relationships between different content elements, improving the accuracy and relevance of the search results.
Write content using clear, concise, unambiguous, and grammatically correct language. Avoid jargon, idioms, and complex sentence structures. (Keep in mind that AI systems do not know how to joke around.)
Use appropriate semantic markup, metadata, and tagging on content elements. This provides essential context and meaning to the content, helping the NLP search engine to better understand and index the content.
Use consistent terminology across the entire body of content. This reduces ambiguity and helps the NLP search engine identify relevant content more effectively. (It also makes the content more usable.)
Finetune AI models using your company corpus and any additional validated resources you can use to improve the understanding of your content context and increase the accuracy of AI chatbot responses. Your RAG will thank you.
Use a human-in-the-loop approach, where AI-generated responses are reviewed and validated by subject matter experts before being presented to users, to help ensure the accuracy and reliability of the information provided.
Allow users to provide feedback on AI-generated responses. This feedback can be used to further fine-tune the model, helping it learn from its mistakes and improve its performance over time.
AI models can often provide a confidence score for their generated responses. Set a minimum confidence threshold to filter out low-confidence answers that might be incorrect or irrelevant.
Ensure that the AI model is capable of understanding user context and intent to provide more relevant and accurate search results or answers.
Regularly monitor the AI model’s performance and make necessary updates or adjustments to ensure it continues to provide accurate and reliable information.
Always establish guidelines and policies to address ethical concerns, such as privacy, security, and data usage, particularly when implementing AI models in medical settings.

With or without the addition of RAG, for an NLP-enabled or LLM-enabled search engine to be effective, the body of content upon which it operates needs to be well-organized, consistent, and use content markup and tags.

While we’d love to say that artificial intelligence is the panacea for all of our content ailments, it simply is not. The best thing you can do right today to prepare for AI tomorrow, is clean up your content, organize it, and structure it so that people and machines have an easier time finding what they need and understanding what they find.

Learn more about how to best transition your content to AI engines. Subscribe to the Content Rules newsletter and get expert tips right in your inbox.

Author
Recent Posts

Val Swisher

Val Swisher is the Founder and CEO of Content Rules, Inc. Val enjoys helping companies solve complex content problems. She is a well-known expert in content strategy, structured authoring, global content, content development, and terminology management. Val believes content should be easy to read, cost-effective to create and translate, and efficient to manage. Her customers include industry giants such as Google, Cisco, Visa, Facebook, Roche, and IBM. Her fourth book, “The Personalization Paradox: Why Companies Fail (and How to Succeed) at Creating Personalized Experiences at Scale,” was published in 2021 by XML Press.

Val is on the Advisory Board for the Technical Communications Program at the University of North Texas. When not working with customers or students, Val can be found sitting behind her sewing machine working on her latest quilt. She also makes a mean hummus.

AI and Magical Thinking: Tips for How to Move to Large Language Models

You Can’t Train an AI Engine With Crappy Content

Tasks to Consider When Making the Move to an LLM

Preparing Content for AI: 6 Reasons...

The Triage Triangle: Solving Comple...

Content Strategy and Taxonomy: Don'...

Get More Value from Less Content...

How to Improve AI Performance? Do T...

Is AI Really Running Out of Content...

The Secret to Success: Good Writing...

Pharma Content Automation...

Software-Neutral Content Standards...

Content Problems vs. Process Proble...

When in Doubt, Componentize...

The Content Developer’s Guide to Br...

Preparing Content for Generative AI...

Why is Everything so Complicated?...

Moving to Generative AI: Train Your...

Prepare Your Content Today for Gene...

7 Ways to Write for Content Reuse E...

How to Personalize Your Standard Re...

Personalized Content in Pharma Is N...

The almost-arrived future of AI-enh...

Working with a Content Services Age...

Three Tips to Care for Your Contrac...

How to Find an Exceptional Content ...

Why You Should Engage a Company Whe...

3 Telltale signs that you need to u...

5 Tips for Selecting Your First Pha...

Data Not Documents...

What is Structured Content? Making ...

Lowering Risk in Pharma With Struct...

Single-source publishing to multipl...

What Pharma Labeling Content Can Te...

Pharma Deserves Better...

Structured Content for Pharma: Maki...

There’s No Such Thing as a Perfect ...

The CDS is Dead! Long Live the CDS!...

The CDS is Dead! Long Live the CDS!...

Special Statements in Writing: 4 Ru...

10 Ways an SCM Optimizes Your CDS...

The CDS is Dead! Long Live the CDS!...

How to Create a Unified Content Exp...

Getting Granular with eCTD v4.0...

Solving the XML Authoring Conundrum...

The Five Dimensions of Content Stan...

Standards Make the Promises of Digi...

A Lesson in Content Management: Vig...

The Five Dimensions of Content Stan...

Who Owns My Content and Why Isn’t I...

The Five Dimensions of Content Stan...

XML Standards Provide Rigor and Fle...

Content Strategy for Pharma: Spend ...

Solving Pharma Content Challenges...

The Pharma Content Evolution: Conte...

The Unique Challenges of Pharma Con...

Manage the Information, Not (Just) ...

COVID and Content: The Role of Cont...

What Is a Reuse Map?...

When AI Goes Bad...

Efforts Without Tools Are Just Best...

The Three Types of AI Analytics...

To Chunk or Not to Chunk? 3 Questio...

Your Search Isn’t Broken, Your Cont...

MedTech: The Time is Now for Struct...

Solving the 6 Challenges of Medical...

From Startup to Structure...

Implementing Structured Authoring: ...

Why Structured Content is Essential...

Streamlining Your Illustration Work...

5 Guidelines for Writing Clinical T...

3 Knowledge Graphs Used in Everyday...

6 Best Practices for Creating Reusa...

Knowledge Graphs: A Quick Start for...

3 Factors to Consider When Choosing...

Content Rules becomes exclusive wor...

Curate Your Components: The Persona...

The Personalization Paradox Book...

The Five Phases of the Translation ...

Phase 2: Sending Your Content to Tr...