Recent Developments in LLM Architectures: KV…

Sebastian Raschka, PhD

May 16

From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs

Read →

19 Comments

Szymon Palucha

Jun 8

Thanks for another great article!

I think there might be a few small typos around the figure numbers, e.g.

```

The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward.

As hinted at in Figure 12,

```

I think maybe should refer to Figure 13

and then below

```

Next to the sequence mixing shown in Figure 13, there is also a channel mixing component.

```

should be Figure 14?

Reply (1)

Sebastian Raschka, PhD

Jun 8

Good catch. I must have missed to bump the numbers there when I added an additional figure. Fixed it. Thanks!

Denezio

May 25

Thanks for the article

I really hope you cover some new "interesting" training recipes as well in one of your future posts

Reply (1)

Sebastian Raschka, PhD

May 25

Yes, it’s about time… Training is a harder topic though because of the $$$ required for doing something interesting plus many details are kept secret. But yeah, interesting topic!

TWei

May 23

Do you plan to do a deep dive in MTPs (qwen3.6 and Gemma4 for instance)?

Reply (1)

Sebastian Raschka, PhD

May 24

Thanks for recommending. I covered MTP briefly in the past and it might be indeed worthwhile revisiting. Can’t promise yet though since there are some other things on my list as well

Manish Prakash

May 20

These architecture gains matter more for agents than for chat demos. Once an agent keeps rereading docs, diffs, tests, and logs, long-context efficiency stops being a benchmark detail and starts becoming product UX. Cheaper context is basically more room for reasoning plus verification.

Reply (1)

Sebastian Raschka, PhD

May 21

Yes exactly, agents are very long-context demanding

Mhmd

May 16

thanks

Think AI

May 16Edited

Sebastian, this is a great reminder that LLM progress is not always visible from the chat interface. A model may look slightly better to users, but under the hood, changes in attention, MoE routing, KV cache efficiency, and long-context design can make a huge difference in cost, speed, and usability. The architecture layer is where a lot of the real competition is happening.

Joseph Liaw

May 16

The clearest explanation of CSA/HCA I have ever seen, very easy to follow!

Reply (1)

Sebastian Raschka, PhD

May 16

Thanks, glad to hear the visualizations helped!

Sahil Maheshwari

May 16

Crazy , will need to check out later.

Emanuel Maceira

Jul 2

Exceptional technical survey — this is required reading for anyone designing inference systems in 2026.

I want to add a dimension that I think is underrepresented in the conversation: these KV cache architecture decisions are not just cloud inference optimizations. They are, arguably, even more consequential at the edge and on-device level — and the specific techniques covered here map directly onto what makes or breaks an edge AI deployment.

On the KV sharing / cross-layer reuse front (Gemma 4’s ~50% KV reduction): for devices with 4-8GB LPDDR5 shared across CPU, NPU, and radio subsystems, halving the KV footprint at 128K context is the difference between a model that fits in memory and one that doesn’t. On edge hardware there is no graceful degradation through swapping — you either fit or you don’t.

The ZAYA1-8B CCA approach (convolutional mixing in compressed latent space, reducing both KV size and prefill FLOPs) is particularly relevant for edge because it targets the prefill phase — which on mobile silicon with limited parallelism is where most of the latency cost lives. The decode phase at 1-4B parameter scale is already fast enough on a good NPU; it’s the prefill cost that stalls the user experience.

HCA’s 128-to-1 token summarization into dense KV at 1M context (DeepSeek V4) is fascinating from an edge perspective not because edge devices will run 1M-context models today, but because the underlying technique — hierarchical summarization of context into compressed representations — is exactly the architecture pattern needed for edge devices that need to process long sensor histories (weeks of telemetry, ambient audio context, message logs) without the KV cache growing unboundedly.

Scenarica’s point about adaptive attention budget is important. The edge deployment challenge is that the budget needs to adapt not just to context type but to real-time hardware state — thermal headroom, battery level, available memory bandwidth. An architecture that can dial between KV sharing (cheap, lossy) and full attention (expensive, complete) at inference time, based on device state, would be genuinely transformative for always-on edge AI. That’s the next frontier.

Jakob Ehe

May 27

KV sharing is essentially the architecture discovering that some of what it stored was redundant — and compressing it away. Shannon would have called that removing entropy you didn't need. The deeper pattern here is that every generation of compute constraints forces the field to rediscover information theory from the bottom up. Curious whether you see the attention budget optimisations converging toward something like a channel capacity limit — a Shannon bound for transformers.

Devansh

May 23

Not super related to the article but are you tracking any other fields of ai research, aside from models themselves? I've taken a huge interest in ontologies and storage as a way to cut down the inference costs of agentic systems. And hardware. Curious what's gotten your interest these days.

Scenarica

May 17

The convergence across all four architectures toward KV cache reduction tells you where the real constraint in the industry has moved. Training compute was the bottleneck for the 2020-2024 era. Inference efficiency at long context is the bottleneck for the agent era. Every architecture in this piece is solving the same problem from a different angle, how to keep attention costs sublinear as context windows expand into the hundreds of thousands of tokens that agentic workflows demand.

The interesting question is whether these are converging on a single optimal solution or whether different deployment scenarios will favour different approaches. KV sharing suits layers with redundant representations. Compressed attention suits contexts with exploitable structure. In production agent systems where context length and structure vary enormously across tasks, the winning architecture might be the one that dynamically selects between these techniques per layer per inference pass rather than committing to a single strategy at training time. thats a harder engineering problem than any individual approach here but its probably where the field converges, an adaptive attention budget that allocates efficiency method to context type in real time.

Reply (1)

The Synthesis

May 23

Laguna XS.2's layer-wise attention budgeting is already a step toward the per-layer selection you're describing, just static rather than chosen at runtime. The harder gap shows up in deployment: Google's TurboQuant hit 6x KV-cache compression in the lab but only https://thesynthesisai.substack.com/p/the-demand-destroyer. That spread is where "structure varies enormously across tasks" stops being a feature you exploit and starts being a tax you pay.

Takaki Ishibashi

May 17

記事をありがとう！