Moving to AI can be challenging, but training your large language models with structured content can produce more accurate results. | Illustration of men sitting on a brain | Content Rules

By Val Swisher and Becca Morn 

A funny thing is happening lately in our conversations with customers. Ever since ChatGPT and its competitors made their way into the public’s consciousness, customers have been thinking, “If we wait long enough, these AI-powered search engines will fix all of our woes.”

Wouldn’t it be great if that were the case? If somehow, automagically, all of the messy, poorly organized, inconsistent content would be fixed by a new technology without us having to do anything about it. Even better – we can keep writing crappy content in the same way as we always have, and these technologies will make it beautiful, readable, consistent, and easy to find all without us having to even click a mouse.

Indeed, magical thinking is just that – magical.

READ MORE: The secret to effective content is easier than you think

You Can’t Train an AI Engine With Crappy Content

And if you do, you shouldn’t expect anything other than crappy results.

Let’s start with Large Language Models (LLMs) such as the variety of GPTs, Ernie, Bert (I am not making that up), Llama, Lamda, and the rest. These LLMs are not equipped with machine learning (ML) capabilities. They know only what you teach them.

The old adage, garbage in/garbage out, applies just as well to LLMs as it does to any other search capability.

In response to the limitations of LLMs, many companies are now using Retrieval-Augmented Generation (RAG) engines. RAG improves the results of LLMs by tapping into external sources of data to augment the information in the LLM. In theory, the results of RAG queries are less likely to contain mistakes or hallucinations, where the AI makes up information seemingly out of nowhere.

The addition of a RAG to an AI solution is a great way to provide an outside system of checks and balances. However (there is always a “however”), the content that the RAG retrieves also has to be clean and accurate.

According to Sthanikam Santhosh, in his Medium article “How to improve RAG (Retrieval Augmented Generation),” there are a number of steps you should take to enhance the quality and accuracy of the results, including:

  • Clean and organize the content.
  • Remove extra information, such as metadata, text, and special characters that are not needed
  • Consolidate or deduplicate information to create a single source of truth
  • Remove inconsistent and redundant information
  • Chunk the content
  • Each chunk should contain information that would answer a single user question
  • Each chunk should contain all of the information that is needed to answer a single user question
  • Use the “right size” chunk for your content and needs
  • Embed models, particularly where specific terminology will make the results more accurate
  • The RAG compares the meaning of sentences.
  • By embedding terminology and using it consistently, the results will be more accurate
  • Pay attention to and spend time tuning prompts
  • Fun fact – LLMs focus more on text at the beginning and end of a prompt

Adding RAG systems to your LLM environment helps improve the accuracy of results to queries. However, this doesn’t mean that you can ignore the content that you use for LLM training. The cleaner the training content is, the better overall results you will get.

READ MORE: Do this one thing to improve your AI performance 

Tasks to Consider When Making the Move to an LLM

If you are making the move to an LLM with or without the addition of RAG, consider these tasks:

  • Tag all searchable content according to its type and context. Depending on finding specific words in titles or in the text is not enough, especially if the same terms are not used consistently.
  • A large index database of search term synonyms is required to support NLP, and such requires ongoing, continuous maintenance and updating.
  • Organize content using a consistent and logical structure, such as headings, sections, and metadata. Structured content allows the NLP search engine to understand the relationships between different content elements, improving the accuracy and relevance of the search results.
  • Write content using clear, concise, unambiguous, and grammatically correct language. Avoid jargon, idioms, and complex sentence structures. (Keep in mind that AI systems do not know how to joke around.)
  • Use appropriate semantic markup, metadata, and tagging on content elements. This provides essential context and meaning to the content, helping the NLP search engine to better understand and index the content.
  • Use consistent terminology across the entire body of content. This reduces ambiguity and helps the NLP search engine identify relevant content more effectively. (It also makes the content more usable.)
  • Finetune AI models using your company corpus and any additional validated resources you can use to improve the understanding of your content context and increase the accuracy of AI chatbot responses. Your RAG will thank you.
  • Use a human-in-the-loop approach, where AI-generated responses are reviewed and validated by subject matter experts before being presented to users, to help ensure the accuracy and reliability of the information provided.
  • Allow users to provide feedback on AI-generated responses. This feedback can be used to further fine-tune the model, helping it learn from its mistakes and improve its performance over time.
  • AI models can often provide a confidence score for their generated responses. Set a minimum confidence threshold to filter out low-confidence answers that might be incorrect or irrelevant.
  • Ensure that the AI model is capable of understanding user context and intent to provide more relevant and accurate search results or answers.
  • Regularly monitor the AI model’s performance and make necessary updates or adjustments to ensure it continues to provide accurate and reliable information.
  • Always establish guidelines and policies to address ethical concerns, such as privacy, security, and data usage, particularly when implementing AI models in medical settings.

With or without the addition of RAG, for an NLP-enabled or LLM-enabled search engine to be effective, the body of content upon which it operates needs to be well-organized, consistent, and use content markup and tags.

While we’d love to say that artificial intelligence is the panacea for all of our content ailments, it simply is not. The best thing you can do right today to prepare for AI tomorrow, is clean up your content, organize it, and structure it so that people and machines have an easier time finding what they need and understanding what they find.

Learn more about how to best transition your content to AI engines. Subscribe to the Content Rules newsletter and get expert tips right in your inbox. 

Val Swisher