Brief Summary
This video explains the concept of context windows in Large Language Models (LLMs) and their impact on performance. It covers how context windows affect an LLM's ability to remember and process information, the trade-offs between context window size and computational resources, and techniques to optimise context window usage for local models. The video also touches on the security implications of large context windows.
- LLMs have limited short-term memory, defined by their context window.
- Larger context windows require more computational power and VRAM.
- Techniques like flash attention and data compression can help optimise context window usage.
Intro - What is a Context Window?
LLMs, like humans, have a limited short-term memory, which is referred to as the context window. This window determines how much information the LLM can remember during a conversation. As conversations grow longer, LLMs may start to forget earlier details, hallucinate, or slow down due to the constraints of their context window. The length of the conversation affects the LLM's ability to store and process information effectively.
LM Studio Demo: How Much Can AI Remember?
The presenter uses LM Studio to demonstrate the context window. Tokens are how AI models count words, with a sentence being broken down into tokens. The Gemma 34B model is loaded with a context window of 2048 tokens, meaning it can only pay attention to that many tokens at once. By telling the model a fact and then exceeding its context window with additional information, the presenter shows how the model forgets the initial fact. Increasing the context window allows the model to remember the initial information. System prompts, documents, and code also consume tokens, filling up the context window.
Local Models vs. Cloud Models
While increasing the context window seems like a solution, there are limits. Local models are constrained by the available VRAM (video RAM) on the GPU. Attempting to load a model with an excessively large context window can max out the VRAM, causing performance issues. Cloud models, such as GPT-4o, Claude 3, and Gemini 2.5, offer larger context windows without the hardware limitations of local models.
Why LLMs are bad at paying attention
Even with large context windows, LLMs can struggle to maintain attention throughout a conversation. A study showed that LLMs tend to be more accurate with information at the beginning and end of a conversation, while performance drops in the middle, forming a U-shaped curve. This is because LLMs use attention mechanisms to assign scores to words, determining their relevance. Longer conversations require more complex calculations, increasing computational demands and potentially leading to hallucinations and slower responses. When shifting topics, it's better to start a new chat to improve performance.
How to Max Out your Context Windows
Several optimisations can help maximise context window usage for local models. Flash attention changes how the model computes attention, skipping the full table of token comparisons and processing tokens in chunks, improving memory and speed. Kcach and Vcash compress data to take up less room in VRAM. Paged cache moves attention cache between the GPU (VRAM) and system RAM, sharing memory but potentially slowing down performance. While larger context windows are desirable, they require more memory and power, and they also increase the attack surface, making the LLM more vulnerable to malicious prompts.