Learning - 2023-11-01
Deep Dive: Memory for LLM applications
by Sergio Prada

Tapestry of ancient elephant playing drums.
Introduction
Language models have revolutionized the way we interact with machines, allowing for more fluid and natural conversations. However, building effective applications that leverage these models presents several challenges, particularly in terms of memory usage. This article delves into the intricacies of memory consumption in large language models (LLMs) and explores strategies that developers can adopt to their advantage.
Defining Memory in the World of LLMs
In the realm of LLMs, it's crucial to understand the distinction between stateful and stateless operations. A stateless operation doesn't retain any information from one request to the next, allowing for independent functioning. In contrast, a stateful operation remembers information across requests, providing continuity and context.
LLMs, by their inherent design, are stateless. This means that they don't retain any context from previous interactions. However, interfaces like ChatGPT introduce a state to these models by consistently passing the entire message history or context every time a prompt is given. This method ensures the model is aware of the conversation's history, enabling more contextual and relevant responses.
Memory, in this context, refers to the state that is shared across LLM calls. The management of this memory is pivotal for the performance and scalability of LLM-based applications. Continuously passing an extended conversation history to an LLM not only raises the token cost but can also often lead to increased entropy in the model's responses, especially as the conversation extends in length.
Different applications have adopted varied strategies to manage this challenge. As previously mentioned, ChatGPT introduces statefulness by passing the entire message history with each prompt (and potentially using other techniques unknown to users).
Other applications might truncate or selectively pass only relevant portions of the conversation, utilize external state management systems, or even combine multiple techniques to strike a balance between performance and context preservation. As LLMs continue to find their way into more applications and platforms, efficient memory management becomes an indispensable consideration for developers.
Memory as a Precious Resource
When we talk about memory in LLMs, the primary resource in play is the context window size. It determines the amount of previous conversation or context the model can "remember" at any given time. As developers, we need to treat this context window size as a precious resource. Not only is it limited, but deciding what information should be within this window is arguably one of the most challenging aspects of LLM application development.
The Myth of Larger Context Window Size
It's a common misconception that a larger context window size automatically yields better results from the LLM. However, research has shown otherwise. The paper "Lost in the Middle: How Language Models Use Long Contexts” by F. Liu et al. in 2023 revealed that merely increasing the context window size doesn't guarantee improved performance or more relevant outputs from the model.

Adapted from "Lost in the Middle: How Language Models Use Long Contexts" by F. Liu et al., 2023 Figure 6
Another interesting observation from this paper is that the position of the relevant information in the context affects the accuracy of the response. This positional influence implies that more recent information in the context is often given more weight by LLMs. For developers, this means strategically placing the most crucial details or cues either at the beginning or closer to the end of the context window to maximize their influence on the model's output.

Adapted from "Lost in the Middle: How Language Models Use Long Contexts" by F. Liu et al., 2023 Figure 1
Additionally, while the temptation might be to cram as much information as possible into the context, it's essential to ensure clarity and relevance. Overloading the model with unnecessary details might dilute the importance of key points, leading to less accurate or relevant responses.
Moreover, the researchers pointed out strategies that can be utilized to improve the performance of LLMs with long contexts. They suggested, '...effective reranking of retrieved documents (pushing relevant information closer to the start of the input context) or ranked list truncation (returning fewer documents when necessary; Arampatzis et al., 2009).' This underlines the importance of not only selecting the right information but also presenting it in an order that optimizes model performance.
Techniques to Optimize Memory Usage
Given the constraints and challenges with memory, developers have come up with innovative techniques to optimize its usage:
Rolling Window
One such technique is the Rolling Window. It operates by preserving a fixed-size window of the most recent interactions or messages. As new content is introduced, the older information naturally phases out of the window, ensuring that the model is always equipped with the latest context. This approach is both intuitive and straightforward to implement, making it an economical choice as it avoids extra LLM invocations. However, its primary drawback is the potential loss of any information that exists outside of this window.

Rolling window
Incremental Summary
Another approach is the Incremental Summary. Rather than feeding the entire conversation to the model, this method extracts and processes the core essence of the dialogue, providing the model with a summarized version. The main advantage of this technique is its ability to reduce explicit data loss, and it even allows developers to guide the model with specific prompts. On the flip side, summarization can be a lossy process, often missing out on nuances. Moreover, generating these summaries typically requires an additional LLM call, leading to extra costs.
This is what we use in Motorhead, an open source LLM memory server developed early on by Metal. It provides simple APIs to manage multiple sessions and scale chat applications.

Incremental/Recursive Summarization
Custom State Management
Lastly, there's the Custom State Management technique. The foundational idea of this approach is fostering an inherent "awareness" within the LLM about its own memory constraints, underlining the fact that memory resources are finite. Using this concept, a system can be architected where the LLM is used as a decision maker, determining when to retrieve or store pertinent memories.
For instance, consider a setup where a "reflection call" is made to the LLM before it crafts a response. This introspective step checks the current state and its dimensions, thereby deciding if there's a need to either store/replace new memories or fetch existing ones from an external repository.
Several methods have been proposed to actualize this technique, implementations like MemGPT and Self-RAG adding to the growing body of knowledge. The allure of this approach is its potential to cultivate an unmatched user experience, tailored to individual use cases. Yet, it doesn't come without its intricacies. Implementation can be multifaceted, and the recurrent engagements with the LLM might bump up operational expenses.

Custom State Management
What about knowledge?
Grounding an LLM application – making it knowledgeable and relevant to specific domains – is often indispensable. This 'knowledge' is typically infused into the model by being part of the context window or prompt. However, the introduction of domain or application-specific knowledge can complicate memory management. With a fixed context window size, developers are faced with the dilemma of deciding what takes precedence: conversational history or domain-specific knowledge.
For instance, in a medical chatbot, while the recent conversation history is essential to maintain context, domain knowledge about medical terms, symptoms, and treatments is equally crucial. Neglecting one for the other could lead to either a loss of conversational context or a decrease in the application's domain expertise.
Additionally, as LLMs and their associated techniques become more sophisticated, the line between 'stored memory' and 'dynamic retrieval' will likely blur, offering even more refined ways to manage and utilize knowledge in these systems.
Conclusion
Memory management is a cornerstone of LLM application development. By understanding its intricacies and leveraging innovative techniques, developers can ensure their applications are not only efficient but also deliver optimal performance. As the field of LLMs continues to evolve, so will the strategies and tools to harness their full potential.