24 Comments
User's avatar
Vivek Nayyar's avatar

One of the best articles I have read and so well written. Please continue writing more 🙏

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the kind words!

Expand full comment
Vivek Nayyar's avatar

Please also if possible write about query, key and value tensors. The role each of them play in deciding the next token

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the suggestion! I think this may already be covered in my previous attention mechanism article: https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention ?

Expand full comment
Michael Xie's avatar

Very well written, clear, sharp, entertaining to read, and very educational. Thank you!

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks!

Expand full comment
RR's avatar

Great to see you back in action!! I hope your recovery is proceeding well.

Expand full comment
Mariano Kamp's avatar

Thank you Sebastian. Qq: the sliding window would mean that alongside the cache the context window is also truncated, right?

Expand full comment
Sebastian Raschka, PhD's avatar

Yes, it would truncate the context.

Expand full comment
kevin's avatar

3kx a lot!

Expand full comment
Peter van Beek's avatar

Thank you, great tutorial! One question I had about KV caching is why it is not also applied to the query data. It seems to me that if x is the full context in the vanilla implementation, then queries = self.W_query(x) also recomputes many query tokens repeatedly. I must be missing something, but previously couldn't find a clear answer.

Expand full comment
Sebastian Raschka, PhD's avatar

That's because for the current query, you don't need past queries, only past keys and values when computing the attention score.

Expand full comment
Sahibpreet Singh's avatar

One of the best Implementation and explanation for Kv Cache

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks!

Expand full comment
Scott Gwynn's avatar

Thank you. I have a keen interest in this - especially on apple. I am beginning to learn about the constrained generation used with their on-device foundation model. This might be good content for you as well.

Expand full comment
Sebastian Raschka, PhD's avatar

Interesting! I heard they announced API access to their on-device models in iOS 26 at their recent WWDC conference. Do you know if they said something about macOS 26 and opening the API up there as well?

Expand full comment
Alpha Xiao's avatar

Thanks for sharing! I have two questions:

1) For models running on CUDA devices, does this KV cache technique still apply? If so, does the cache use GPU or CPU memory?

2) Does KV cache work across different inference sessions? e.g., the system prompt can utilize the cache.

Expand full comment
Sebastian Raschka, PhD's avatar

Yes this works for CUDA as a well. You can initialize placeholder tensors on the CUDA device directly for optimal efficiency. For this small demo there was not much of a speed benefit but I’ll add the KV cache to my Llama 3 and Qwen3 from-scratch models this weekend so the difference will be more visible.

Expand full comment
prasadraje's avatar

Great article as always Sebastian, clearly spelling out the KV cache benefits. To illustrate this further: here is a compute analysis spreadsheet that illustrates the specific KV cache total values and the compute reduction due to KV cache for Deepseek, Llama, Qwen (essentially we get GEMVs instead of GEMMs in the attention and MLP blocks). https://www.linkedin.com/posts/prasadraje_i-am-making-available-for-free-this-llm-compute-activity-7326488598705242112-LlwX

Expand full comment
Kartik Ramesh's avatar

Wow this is like exactly what I needed? Thanks Sebastian!

Expand full comment
Logan Thorneloe's avatar

Love this! Thank you for sharing. I hope your recovery is swift.

Expand full comment
Halidu Abdulai's avatar

Great tutorial! I have a question about how ambiguous words like "duck" when they appear multiple times with different meanings would be treated in this case.

For example, in the sentence:

"He had to duck when the duck flew at him."

The first "duck" is a verb, and the second is a noun. Since we make use of cached key-value (KV) pairs for previously seen tokens, what happens in this case?

If we cache the first "duck" (the verb), do we simply reuse its KV pair when we encounter the second "duck"? Shouldn't their representations be different due to their distinct roles and contexts?

Expand full comment
Sebastian Raschka, PhD's avatar

Could question, and I see what you mean. However, this would not be an issue since we retrieve the previous keys and values by position.

Expand full comment
Yousra Ibrahim's avatar

Please, I want to become a paid subscriber so I can see your closed content. Thank you

Expand full comment