Recently, I have been doing a bunch of research on Retrieval Augmented Generation (RAG). RAG uses a combination of generative-based and retrieval-based artificial intelligence to produce results. Essentially, it combines what it finds with what it creates and gives you an answer to a query.
I came across a number of articles that discussed the importance of chunking content to improve RAG performance. The articles were written by engineers who specialize in AI for large corporations. They know their stuff.
The articles all basically said the same thing. If you want to improve the performance of your RAG (and consequently your LLM), you need to chunk your content. Chunking content makes it easier for the AI to understand it. Furthermore, chunking the content makes it easier for the AI to know what information is grouped together by meaning.
What came next threw me. Many of the articles went on to include code that you can use to have an AI engine chunk longform, already published content. Now, this sort of makes sense. If you can programmatically chunk existing output content, using software to do the chunking can certainly save you some effort of going back to that existing content and chunking the source itself.
In order to chunk the content, the AI needs to find patterns. To my amusement, one of the methods suggested was to use fonts and font sizes. From a tech doc perspective, it is the equivalent to looking at a Heading 1 and then chunking it into all of the associated Heading 2s. You can chunk even further if you have multiple Heading 3s. Rather than using styles, the code examines the font and font size to determine the heading level.
I say “to my amusement,” because those of us who have been componentizing content for years understand that relying on fonts to determine where to chunk content is not a bulletproof method for success. In fact, it’s almost guaranteed that your content components will not be discrete units of information that can stand alone. They cannot be considered a single source of truth. This content will still reflect the linear, monolithic document that it came from.
Using the output as the source material to chunk content is bulky and cumbersome.
READ MORE: Thinking about moving to ChatGPT? You’ll want to do this first.
Why Not Create the Content in Components?
This way, you can train the AI with content that is already chunked, rich with semantic markup and metadata that provides information to the system. For example, you can base your chunking on meaningful criteria such as product, information type (is it instructions? Is it a safety warning?) and audience level (Novice? Expert? Student? Teacher?).
According to the AI experts, rich metadata helps the RAG return more accurate results. Thus, the metadata you apply to help humans find and use the content also helps the AI find and use the content. It’s a win-win.
Think about it. If you structure your content first, and then use that structured content to train your AI, you are several steps ahead in terms of RAG performance. If you further optimize the content, such as by single-sourcing, using consistent terminology, and eliminating inconsistencies, the quality of your results will continue to improve.
Do not underestimate the value of structured content on the performance of AI. They are inexorably linked. If you are moving to AI, structure your content first.
Is your content ready for AI? Subscribe to our newsletter and get tips, tricks, and helpful advice from our content experts.
- How to Make Conferences More Inclusive for the Hard of Hearing Community - December 2, 2024
- Preparing Content for AI: 6 Reasons Why You’re Not Ready - August 29, 2024
- How to Be Inclusive in the Workplace: My Experience as a Hard of Hearing Person - August 12, 2024