18 Comments
User's avatar
Szymon Palucha's avatar

Thanks for another great article!

I think there might be a few small typos around the figure numbers, e.g.

```

The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward.

As hinted at in Figure 12,

```

I think maybe should refer to Figure 13

and then below

```

Next to the sequence mixing shown in Figure 13, there is also a channel mixing component.

```

should be Figure 14?

Sebastian Raschka, PhD's avatar

Good catch. I must have missed to bump the numbers there when I added an additional figure. Fixed it. Thanks!

Denezio's avatar

Thanks for the article

I really hope you cover some new "interesting" training recipes as well in one of your future posts

Sebastian Raschka, PhD's avatar

Yes, it’s about time… Training is a harder topic though because of the $$$ required for doing something interesting plus many details are kept secret. But yeah, interesting topic!

TWei's avatar

Do you plan to do a deep dive in MTPs (qwen3.6 and Gemma4 for instance)?

Sebastian Raschka, PhD's avatar

Thanks for recommending. I covered MTP briefly in the past and it might be indeed worthwhile revisiting. Can’t promise yet though since there are some other things on my list as well

Manish Prakash's avatar

These architecture gains matter more for agents than for chat demos. Once an agent keeps rereading docs, diffs, tests, and logs, long-context efficiency stops being a benchmark detail and starts becoming product UX. Cheaper context is basically more room for reasoning plus verification.

Sebastian Raschka, PhD's avatar

Yes exactly, agents are very long-context demanding

Mhmd's avatar

thanks

Think AI's avatar

Sebastian, this is a great reminder that LLM progress is not always visible from the chat interface. A model may look slightly better to users, but under the hood, changes in attention, MoE routing, KV cache efficiency, and long-context design can make a huge difference in cost, speed, and usability. The architecture layer is where a lot of the real competition is happening.

Joseph Liaw's avatar

The clearest explanation of CSA/HCA I have ever seen, very easy to follow!

Sebastian Raschka, PhD's avatar

Thanks, glad to hear the visualizations helped!

Sahil Maheshwari's avatar

Crazy , will need to check out later.

Jakob Ehe's avatar

KV sharing is essentially the architecture discovering that some of what it stored was redundant — and compressing it away. Shannon would have called that removing entropy you didn't need. The deeper pattern here is that every generation of compute constraints forces the field to rediscover information theory from the bottom up. Curious whether you see the attention budget optimisations converging toward something like a channel capacity limit — a Shannon bound for transformers.

Devansh's avatar

Not super related to the article but are you tracking any other fields of ai research, aside from models themselves? I've taken a huge interest in ontologies and storage as a way to cut down the inference costs of agentic systems. And hardware. Curious what's gotten your interest these days.

Scenarica's avatar

The convergence across all four architectures toward KV cache reduction tells you where the real constraint in the industry has moved. Training compute was the bottleneck for the 2020-2024 era. Inference efficiency at long context is the bottleneck for the agent era. Every architecture in this piece is solving the same problem from a different angle, how to keep attention costs sublinear as context windows expand into the hundreds of thousands of tokens that agentic workflows demand.

The interesting question is whether these are converging on a single optimal solution or whether different deployment scenarios will favour different approaches. KV sharing suits layers with redundant representations. Compressed attention suits contexts with exploitable structure. In production agent systems where context length and structure vary enormously across tasks, the winning architecture might be the one that dynamically selects between these techniques per layer per inference pass rather than committing to a single strategy at training time. thats a harder engineering problem than any individual approach here but its probably where the field converges, an adaptive attention budget that allocates efficiency method to context type in real time.

The Synthesis's avatar

Laguna XS.2's layer-wise attention budgeting is already a step toward the per-layer selection you're describing, just static rather than chosen at runtime. The harder gap shows up in deployment: Google's TurboQuant hit 6x KV-cache compression in the lab but only https://thesynthesisai.substack.com/p/the-demand-destroyer. That spread is where "structure varies enormously across tasks" stops being a feature you exploit and starts being a tax you pay.

Takaki Ishibashi's avatar

記事をありがとう!