Learning - 2023-05-18
Learn the ins and outs of chunking for embedding generation.
Painting of chunky cats playing drums at a metal concert
Imagine you’re ordering pasta at a restaurant and you want to know how the noodles are prepared. You ask your waiter who, after thinking for a minute, responds with “in the kitchen”. While factually true, this is not a helpful response. Hoping for a better answer, you flag down another waiter and pose the same question. This time, the server launches into an extensive explanation about the flour’s origin, the egg collection process, and the delivery of these ingredients to the restaurant. While informative, this answer is still not what you were looking for, and even worse your appetite for pasta has come and gone. While each waiter’s response was technically correct, they did a poor job of answering your question with the appropriate amount of relevant details.
If you’re building an LLM based application with embeddings, you want to avoid these kinds of scenarios. Your goal is to build an application that responds to user queries with the right amount of information, given the context. Chunking is one such process that helps you do this.
So what’s the deal with chunking anyway?
Chunking is the process of breaking down large bodies of text into smaller digestible pieces. These smaller pieces, or chunks, are then stored as an embedding in a vector database. Since embeddings are semantic representations of data, they enable semantic queries with natural language to search for and retrieve information.
The goal of chunking is to capture the semantic essence of a text segment in relation to its larger corpus while minimizing noise. You want to break down the text sufficiently without compromising the chunk’s meaning when compared to other chunks in the database. In general, if a chunk conveys meaning to a reader without needing the entire document, it will likely make sense to an LLM as well.
A Diagram articulating the chunking process
How you go about chunking your text data will have a big impact on the way users will be served information in your application.
As a rule of thumb, smaller chunks yield more specific query results, while larger chunks will consider the broader context of the text and how sentences and phrases relate to each other.
With this in mind, consider how your users may query your application. Will they make short queries in search of concise answers? Will they ask longer questions involving related concepts? When asking these questions, the type of data you’re indexing can be a good place to start. If you’re indexing lengthy client proposals or government filings, larger chunks may be appropriate. Whereas smaller chunks may be the way to go when indexing chat messages, for instance.
In practice, many indexes will contain chunks that are both short and long, and there’s no guarantee that users will only make a certain type of query. But chunk size is just one factor to consider, as different methods are available to break up your data.
Different chunking strategies
While there’s no silver bullet to chunking (that we know of) here are three methods commonly used today.
Sentence-based chunking is an approach that divides text into individual sentences using simple punctuation-based rules. Since each chunk is a sentence, it’s best to use this approach when the semantic meaning is captured at the sentence level. This is a good approach for short queries that target specific information.
Fixed-length chunking is another approach that splits text by a predetermined number of tokens. This helps to ensure that chunks are under an LLMs token limit, but it can also lead to sentences or phrases being divided in a way where meaning is lost. This method also doesn’t consider the flow of information in the larger body of text.
A sliding window, or “overlapping”, chunking approach uses fixed-length chunks, but includes an overlap between consecutive chunks. For example, you can split text by 256 tokens with a 50% overlap to another chunk. This will help maintain contextual continuity, but it may lead to redundant information in the index, increased storage requirements, and added query processing time.
Testing your chunking strategy in Metal
It’s pretty straightforward to test potential user queries in Metal. Simply view the Browse tab in an index and enter a query your users could make.
Metal's UI that displays chunked embeddings.
If your query results are starting to sound like the waiters mentioned at the beginning of this article, it may be time to try a new chunking strategy.
We will be announcing more features to help improve chunking soon, accommodating for different data types and use cases. If you have a particular chunking strategy you’ve seen work well, feel free to drop us a line on Discord or get in touch.