Thank you, great tutorial! One question I had about KV caching is why it is not also applied to the query data. It seems to me that if x is the full context in the vanilla implementation, then queries = self.W_query(x) also recomputes many query tokens repeatedly. I must be missing something, but previously couldn't find a clear answer.
Thank you. I have a keen interest in this - especially on apple. I am beginning to learn about the constrained generation used with their on-device foundation model. This might be good content for you as well.
Interesting! I heard they announced API access to their on-device models in iOS 26 at their recent WWDC conference. Do you know if they said something about macOS 26 and opening the API up there as well?
Yes this works for CUDA as a well. You can initialize placeholder tensors on the CUDA device directly for optimal efficiency. For this small demo there was not much of a speed benefit but I’ll add the KV cache to my Llama 3 and Qwen3 from-scratch models this weekend so the difference will be more visible.
Great tutorial! I have a question about how ambiguous words like "duck" when they appear multiple times with different meanings would be treated in this case.
For example, in the sentence:
"He had to duck when the duck flew at him."
The first "duck" is a verb, and the second is a noun. Since we make use of cached key-value (KV) pairs for previously seen tokens, what happens in this case?
If we cache the first "duck" (the verb), do we simply reuse its KV pair when we encounter the second "duck"? Shouldn't their representations be different due to their distinct roles and contexts?
One of the best articles I have read and so well written. Please continue writing more 🙏
Thanks for the kind words!
Please also if possible write about query, key and value tensors. The role each of them play in deciding the next token
Thanks for the suggestion! I think this may already be covered in my previous attention mechanism article: https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention ?
Very well written, clear, sharp, entertaining to read, and very educational. Thank you!
Thanks!
Great to see you back in action!! I hope your recovery is proceeding well.
Thank you Sebastian. Qq: the sliding window would mean that alongside the cache the context window is also truncated, right?
Yes, it would truncate the context.
3kx a lot!
Thank you, great tutorial! One question I had about KV caching is why it is not also applied to the query data. It seems to me that if x is the full context in the vanilla implementation, then queries = self.W_query(x) also recomputes many query tokens repeatedly. I must be missing something, but previously couldn't find a clear answer.
That's because for the current query, you don't need past queries, only past keys and values when computing the attention score.
One of the best Implementation and explanation for Kv Cache
Thanks!
Thank you. I have a keen interest in this - especially on apple. I am beginning to learn about the constrained generation used with their on-device foundation model. This might be good content for you as well.
Interesting! I heard they announced API access to their on-device models in iOS 26 at their recent WWDC conference. Do you know if they said something about macOS 26 and opening the API up there as well?
Thanks for sharing! I have two questions:
1) For models running on CUDA devices, does this KV cache technique still apply? If so, does the cache use GPU or CPU memory?
2) Does KV cache work across different inference sessions? e.g., the system prompt can utilize the cache.
Yes this works for CUDA as a well. You can initialize placeholder tensors on the CUDA device directly for optimal efficiency. For this small demo there was not much of a speed benefit but I’ll add the KV cache to my Llama 3 and Qwen3 from-scratch models this weekend so the difference will be more visible.
Great article as always Sebastian, clearly spelling out the KV cache benefits. To illustrate this further: here is a compute analysis spreadsheet that illustrates the specific KV cache total values and the compute reduction due to KV cache for Deepseek, Llama, Qwen (essentially we get GEMVs instead of GEMMs in the attention and MLP blocks). https://www.linkedin.com/posts/prasadraje_i-am-making-available-for-free-this-llm-compute-activity-7326488598705242112-LlwX
Wow this is like exactly what I needed? Thanks Sebastian!
Love this! Thank you for sharing. I hope your recovery is swift.
Great tutorial! I have a question about how ambiguous words like "duck" when they appear multiple times with different meanings would be treated in this case.
For example, in the sentence:
"He had to duck when the duck flew at him."
The first "duck" is a verb, and the second is a noun. Since we make use of cached key-value (KV) pairs for previously seen tokens, what happens in this case?
If we cache the first "duck" (the verb), do we simply reuse its KV pair when we encounter the second "duck"? Shouldn't their representations be different due to their distinct roles and contexts?
Could question, and I see what you mean. However, this would not be an issue since we retrieve the previous keys and values by position.
Please, I want to become a paid subscriber so I can see your closed content. Thank you