How to Improve AI Performance? Do This One Thing to Your Content

by Val Swisher | May 20, 2024

Recently, I have been doing a bunch of research on Retrieval Augmented Generation (RAG). RAG uses a combination of generative-based and retrieval-based artificial intelligence to produce results. Essentially, it combines what it finds with what it creates and gives you an answer to a query.

I came across a number of articles that discussed the importance of chunking content to improve RAG performance. The articles were written by engineers who specialize in AI for large corporations. They know their stuff.

The articles all basically said the same thing. If you want to improve the performance of your RAG (and consequently your LLM), you need to chunk your content. Chunking content makes it easier for the AI to understand it. Furthermore, chunking the content makes it easier for the AI to know what information is grouped together by meaning.

What came next threw me. Many of the articles went on to include code that you can use to have an AI engine chunk longform, already published content. Now, this sort of makes sense. If you can programmatically chunk existing output content, using software to do the chunking can certainly save you some effort of going back to that existing content and chunking the source itself.

In order to chunk the content, the AI needs to find patterns. To my amusement, one of the methods suggested was to use fonts and font sizes. From a tech doc perspective, it is the equivalent to looking at a Heading 1 and then chunking it into all of the associated Heading 2s. You can chunk even further if you have multiple Heading 3s. Rather than using styles, the code examines the font and font size to determine the heading level.

I say “to my amusement,” because those of us who have been componentizing content for years understand that relying on fonts to determine where to chunk content is not a bulletproof method for success. In fact, it’s almost guaranteed that your content components will not be discrete units of information that can stand alone. They cannot be considered a single source of truth. This content will still reflect the linear, monolithic document that it came from.

Using the output as the source material to chunk content is bulky and cumbersome.

Why Not Create the Content in Components?

This way, you can train the AI with content that is already chunked, rich with semantic markup and metadata that provides information to the system. For example, you can base your chunking on meaningful criteria such as product, information type (is it instructions? Is it a safety warning?) and audience level (Novice? Expert? Student? Teacher?).

According to the AI experts, rich metadata helps the RAG return more accurate results. Thus, the metadata you apply to help humans find and use the content also helps the AI find and use the content. It’s a win-win.

Think about it. If you structure your content first, and then use that structured content to train your AI, you are several steps ahead in terms of RAG performance. If you further optimize the content, such as by single-sourcing, using consistent terminology, and eliminating inconsistencies, the quality of your results will continue to improve.

Do not underestimate the value of structured content on the performance of AI. They are inexorably linked. If you are moving to AI, structure your content first.

Is your content ready for AI? Subscribe to our newsletter and get tips, tricks, and helpful advice from our content experts.

Author
Recent Posts

Val Swisher

Val Swisher is the Founder and CEO of Content Rules, Inc. Val enjoys helping companies solve complex content problems. She is a well-known expert in content strategy, structured authoring, global content, content development, and terminology management. Val believes content should be easy to read, cost-effective to create and translate, and efficient to manage. Her customers include industry giants such as Google, Cisco, Visa, Facebook, Roche, and IBM. Her fourth book, “The Personalization Paradox: Why Companies Fail (and How to Succeed) at Creating Personalized Experiences at Scale,” was published in 2021 by XML Press.

Val is on the Advisory Board for the Technical Communications Program at the University of North Texas. When not working with customers or students, Val can be found sitting behind her sewing machine working on her latest quilt. She also makes a mean hummus.

How to Improve AI Performance? Do This One Thing to Your Content

Why Not Create the Content in Components?

Preparing Content for AI: 6 Reasons...

The Triage Triangle: Solving Comple...

Content Strategy and Taxonomy: Don'...

Get More Value from Less Content...

Is AI Really Running Out of Content...

The Secret to Success: Good Writing...

Pharma Content Automation...

Software-Neutral Content Standards...

The Content Developer’s Guide to Br...

Preparing Content for Generative AI...

Why is Everything so Complicated?...

Moving to Generative AI: Train Your...

Prepare Your Content Today for Gene...

7 Ways to Write for Content Reuse E...

How to Personalize Your Standard Re...

The almost-arrived future of AI-enh...

Working with a Content Services Age...

AI and Magical Thinking: Tips for H...

Why You Should Engage a Company Whe...

3 Telltale signs that you need to u...

5 Tips for Selecting Your First Pha...

Data Not Documents...

What is Structured Content? Making ...

Lowering Risk in Pharma With Struct...

Single-source publishing to multipl...

What Pharma Labeling Content Can Te...

Pharma Deserves Better...

Structured Content for Pharma: Maki...

There’s No Such Thing as a Perfect ...

The CDS is Dead! Long Live the CDS!...

The CDS is Dead! Long Live the CDS!...

Special Statements in Writing: 4 Ru...

10 Ways an SCM Optimizes Your CDS...

How to Create a Unified Content Exp...

Getting Granular with eCTD v4.0...

Solving the XML Authoring Conundrum...

The Five Dimensions of Content Stan...

Standards Make the Promises of Digi...

A Lesson in Content Management: Vig...

The Five Dimensions of Content Stan...

Who Owns My Content and Why Isn’t I...

The Five Dimensions of Content Stan...

XML Standards Provide Rigor and Fle...

Content Strategy for Pharma: Spend ...

The Pharma Content Evolution: Conte...

The Unique Challenges of Pharma Con...

Manage the Information, Not (Just) ...

COVID and Content: The Role of Cont...

What Is a Reuse Map?...

When AI Goes Bad...

Efforts Without Tools Are Just Best...

The Three Types of AI Analytics...

To Chunk or Not to Chunk? 3 Questio...

Your Search Isn’t Broken, Your Cont...

MedTech: The Time is Now for Struct...

From Startup to Structure...

Implementing Structured Authoring: ...

Why Structured Content is Essential...

Streamlining Your Illustration Work...

5 Guidelines for Writing Clinical T...

3 Knowledge Graphs Used in Everyday...

6 Best Practices for Creating Reusa...

Knowledge Graphs: A Quick Start for...

3 Factors to Consider When Choosing...

Content Rules becomes exclusive wor...

Curate Your Components: The Persona...

The Personalization Paradox Book...

The Five Phases of the Translation ...

Phase 2: Sending Your Content to Tr...

Phase 5: After Translation...

20 Questions: A Game of Taxonomy...

The Personalization Paradox: Why Co...

Survey results reveal trends in DIT...

Does Size Matter? Best Practices vs...

Intelligent Content vs Structured C...

Does Your Content Spark Joy?...

Without Content Transformation You ...

Staying in the Loop with Taxonomy...