Understanding and Coding the KV Cache in LLMs from Scratch

Jun 30

Thanks!

Expand full comment

One of the best articles I have read and so well written. Please continue writing more 🙏

Expand full comment

Jun 28

Thanks for the kind words!

Expand full comment

Please also if possible write about query, key and value tensors. The role each of them play in deciding the next token

Expand full comment

Jun 29

Thanks for the suggestion! I think this may already be covered in my previous attention mechanism article: https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention ?

Expand full comment

Great to see you back in action!! I hope your recovery is proceeding well.

Expand full comment

I have been following your resources for almost a decade now, and you single handedly have helped me become a better engineer by leaps & bounds.

Expand full comment

Aug 1

Thanks for the kind comment! On the on hand it makes me feel a bit old but on the other hand, this is very nice to hear!

Expand full comment

Thank you Sebastian. Qq: the sliding window would mean that alongside the cache the context window is also truncated, right?

Expand full comment

Jul 6

Yes, it would truncate the context.

Expand full comment

3kx a lot!

Expand full comment

Thank you, great tutorial! One question I had about KV caching is why it is not also applied to the query data. It seems to me that if x is the full context in the vanilla implementation, then queries = self.W_query(x) also recomputes many query tokens repeatedly. I must be missing something, but previously couldn't find a clear answer.

Expand full comment

Jul 6

That's because for the current query, you don't need past queries, only past keys and values when computing the attention score.

Expand full comment

One of the best Implementation and explanation for Kv Cache

Expand full comment

Jun 20

Thanks!

Expand full comment

Thank you. I have a keen interest in this - especially on apple. I am beginning to learn about the constrained generation used with their on-device foundation model. This might be good content for you as well.

Expand full comment

Jun 20

Interesting! I heard they announced API access to their on-device models in iOS 26 at their recent WWDC conference. Do you know if they said something about macOS 26 and opening the API up there as well?

Expand full comment

Thanks for sharing! I have two questions:

1) For models running on CUDA devices, does this KV cache technique still apply? If so, does the cache use GPU or CPU memory?

2) Does KV cache work across different inference sessions? e.g., the system prompt can utilize the cache.

Expand full comment

Jun 20

Yes this works for CUDA as a well. You can initialize placeholder tensors on the CUDA device directly for optimal efficiency. For this small demo there was not much of a speed benefit but I’ll add the KV cache to my Llama 3 and Qwen3 from-scratch models this weekend so the difference will be more visible.

Expand full comment

Reply

prasadraje

Jun 18Edited

Great article as always Sebastian, clearly spelling out the KV cache benefits. To illustrate this further: here is a compute analysis spreadsheet that illustrates the specific KV cache total values and the compute reduction due to KV cache for Deepseek, Llama, Qwen (essentially we get GEMVs instead of GEMMs in the attention and MLP blocks). https://www.linkedin.com/posts/prasadraje_i-am-making-available-for-free-this-llm-compute-activity-7326488598705242112-LlwX

Expand full comment

Wow this is like exactly what I needed? Thanks Sebastian!

Expand full comment

Love this! Thank you for sharing. I hope your recovery is swift.

Expand full comment

Great tutorial! I have a question about how ambiguous words like "duck" when they appear multiple times with different meanings would be treated in this case.

For example, in the sentence:

"He had to duck when the duck flew at him."

The first "duck" is a verb, and the second is a noun. Since we make use of cached key-value (KV) pairs for previously seen tokens, what happens in this case?

If we cache the first "duck" (the verb), do we simply reuse its KV pair when we encounter the second "duck"? Shouldn't their representations be different due to their distinct roles and contexts?

Expand full comment

Jun 17

Could question, and I see what you mean. However, this would not be an issue since we retrieve the previous keys and values by position.

Expand full comment

Reply