Decoding documents: how SmartBots chunk data with LLMs

Ben McCulloch
December 11, 2023
in Article

Decoding documents: how SmartBots chunk data with LLMs https://vux.world/wp-content/uploads/data-chunking.png 1456 816 Ben McCulloch Ben McCulloch https://secure.gravatar.com/avatar/b1f3549c2d953651d69f59ec1fa801a3?s=96&d=blank&r=g December 11, 2023 December 11, 2023

Consider this scenario; you’re tasked with creating an LLM that allows people to query your organisation’s documents. You know that RAG has been getting a lot of attention recently, and your team wants to try it too.

So you gather all the useful documents. Do you read every last word in every document? Probably not – that would defeat the purpose of using AI to help you rapidly access the nuggets of information within.

The next step is to chunk them. To be efficient, you need to do it automatically without knowing what’s inside.

That’s quite a challenge! How would you do it?

Jaya Prakash Kommu, Co-Founder & CTO, SmartBots, suggested a great solution to this challenge when he recently appeared on the VUX World podcast.

Every point is unique

Before we look at the different strategies you might have for chunking the data, we should first ask this question – how long does it take to make a point?

You know what we mean. Every good document (as well as other media such as videos and audio) leads you towards conclusions, and they take you there via supporting information.

But every author is different, every media type is different, and every document is produced for a different need.

Even the summaries and conclusions made within the document – such as a table of contents, index, bullet points or a conclusion at the end of a section – might not reference every point in the entire document.

When trying to decide how to divide that data into useful chunks, you could decide to separate each paragraph, page or another arbitrarily chosen length – but why? Some writers get to their point in a paragraph, and some get there in a page. It also depends on the subject being discussed. The menu for a company’s annual staff party is (hopefully) going to be more concise than their whitepapers!

An adaptable approach

Smartbots have an innovative approach to this challenge.

They don’t define a strict chunk-length and apply it to everything. If they did that, the results would be variable, because sometimes a nugget of information might be caught within a chunk, and sometimes it won’t. Sometimes it might get split up across a few different chunks, forcing you to have to search harder and work your LLM more on the other end when summarising. The risk is that only part of a point is caught.

Instead, Smartbots use an automated approach that leads to more accurate results, and it’s actually pretty simple.

Here’s how they do it; they use an LLM to chunk the data. What that means is that the LLM sifts through the documents and defines where each nugget of information starts and ends. It could be short, or it could be long, but the point is that it’s adaptable. That makes sense because every document is different.

Once the LLM has chunked it, they can divide the documents up accordingly, and add them to their knowledge base at the heart of RAG. That way, LLMs have been used at the start and end of the process.

When a user queries those documents, they’re presented with both the LLMs responses and links to the documents where the answers were sourced so the user can fact-check and read more if they want to.

Why are you doing it this way?

LLMs are still relatively new, certainly in the enterprise. They’re exciting, and brilliant, but they also go off-the-rails occasionally. We’re still defining the best practices for using them.

While we might try and race ahead by applying them everywhere we can, we need to ensure we’re using them to their advantage rather than adding new problems into the mix.

We get there by asking ‘why are we doing it this way?’

If you think like that, you can see problems in your process, such as how data is chunked before it goes into a knowledge base. A simple fix can work wonders!

Thanks to Smartbots’ Jaya Prakash Kommu for sharing this. You can watch his VUX World interview here.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
resolution	session	This is a functionality cookie used to collect the horizontal value of the visitor screen resolution. It helps in optimizing the website view to the user.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111445333_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
ajs_anonymous_id	never	This cookie is set by Segment.io to check the number of ew and returning visitors to the website.
CONSENT	16 years 2 months 25 days 18 hours	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.
__smVID	1 month	This cookie is set by Sumo. The purpose of the cookie is not yet known.
_mailmunch_visitor_id	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
AnalyticsSyncHistory	1 month	No description
attribution_user_id	1 year	This cookie is set by the provider Typeform. This cookie is used for Typeform usage statistics. It is used in context with the website's pop-up questionnaires and messengering.
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
debug	never	No description available.
intercom-id-or0x2acp	8 months 26 days 1 hour	No description
intercom-session-or0x2acp	7 days	No description
li_gc	2 years	No description
li_sugr	3 months	No description available.
mailmunch_second_pageview	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Decoding documents: how SmartBots chunk data with LLMs

Every point is unique

An adaptable approach

Why are you doing it this way?

4 tips for effective use of LLMs in conversational AI

Conversational AI continues its momentum into 2024