Release - 2023-05-15
Introducing Motorhead, an open-source llm memory server.
Astronauts with guitars
Have you ever had a conversation with someone where they completely forgot what you said to them just moments ago? And even worse, they stare at you like you just met for the first time? Pretty frustrating (and also rude), isn’t it.
Well without the right tooling, LLM powered chat conversations would look like this every single time. This is because LLMs are stateless by default. They treat each incoming user input independently. This not only makes for some pretty incoherent conversations, but it dilutes the magic that this technology is capable of.
Context is king:
Developers building LLM powered applications with chatbots like GPT-3.5 or GPT-4 have to deal with several technical limitations. Not least of all is managing the context, or state, of a conversation given a user input.
To solve this problem, you can use tools that provide memory to chat applications. Memory comes in two forms: short term and long term.
Short term memory refers to anything that is within the current context window of a chat. It’s the reason we can have conversations like this:
When the second question is asked, the previous question and answer are also passed to the LLM via the prompt, providing context to the LLM and allowing for a coherent answer. But the prompt will grow with each subsequent input, eventually reaching the prompt size limit (more on this below). This is where long term memory comes into play.
Long term memory refers to anything that is outside the context of an existing conversation. This not only helps maintain the context length of a conversation, but it allows you to pull in information from data sources connected to the application. This is necessary for applications that contain a significant amount of proprietary information or support long ongoing chat conversations.
What is Motorhead?
Motorhead is a tool that provides both short and long term memory for your applications. It helps developers manage the current context of a conversation through incremental summary and it incorporates information outside of the context window through long term memory.
Prompt sizes are limited by tokens, or a maximum number of characters you can pass in. For example, the token limit for GPT-3.5 is 4096. With incremental summarization, we can compress the chat history (and therefore the tokens) and send it back to the LLM, ensuring the conversation runs smoothly without breaking the token limit.
As messages get pushed for a specific session, Motorhead keeps track of the number of messages. Once a predetermined threshold is reached, a background process is initiated to summarize the messages. In order to consistently adhere to the token limit, Motorhead selects all the messages that can fit within the given window, combines them with the existing summary, and then generates a fresh summary. This process is repeated until it reduces the number of messages stored up to a config value.
Long term memory, on the other hand, is how we pull in context outside of the current chat. We use Redis’ vector similarity search (VSS) to power long term memory, allowing you to search on top of data connected to your application.
long term vs short term memory architecture
We’re using Redis because of how versatile and reliable it is. It provides the data structures for storing message lists, easy methods for fetching slices, as well as a vector similarity search for information retrieval. All of these operations are blazingly fast™ with Redis and Rust.
Motorhead in action
Motorhead can be hosted or self-hosted
You can self-host message data on Motorhead or host your data with Metal. Right now, we support 1,000 messages per month, but if you need a higher volume please feel free to contact us for an increase!