Yes, it’s about time… Training is a harder topic though because of the $$$ required for doing something interesting plus many details are kept secret. But yeah, interesting topic!
Thanks for recommending. I covered MTP briefly in the past and it might be indeed worthwhile revisiting. Can’t promise yet though since there are some other things on my list as well
These architecture gains matter more for agents than for chat demos. Once an agent keeps rereading docs, diffs, tests, and logs, long-context efficiency stops being a benchmark detail and starts becoming product UX. Cheaper context is basically more room for reasoning plus verification.
Sebastian, this is a great reminder that LLM progress is not always visible from the chat interface. A model may look slightly better to users, but under the hood, changes in attention, MoE routing, KV cache efficiency, and long-context design can make a huge difference in cost, speed, and usability. The architecture layer is where a lot of the real competition is happening.
KV sharing is essentially the architecture discovering that some of what it stored was redundant — and compressing it away. Shannon would have called that removing entropy you didn't need. The deeper pattern here is that every generation of compute constraints forces the field to rediscover information theory from the bottom up. Curious whether you see the attention budget optimisations converging toward something like a channel capacity limit — a Shannon bound for transformers.
Not super related to the article but are you tracking any other fields of ai research, aside from models themselves? I've taken a huge interest in ontologies and storage as a way to cut down the inference costs of agentic systems. And hardware. Curious what's gotten your interest these days.
The convergence across all four architectures toward KV cache reduction tells you where the real constraint in the industry has moved. Training compute was the bottleneck for the 2020-2024 era. Inference efficiency at long context is the bottleneck for the agent era. Every architecture in this piece is solving the same problem from a different angle, how to keep attention costs sublinear as context windows expand into the hundreds of thousands of tokens that agentic workflows demand.
The interesting question is whether these are converging on a single optimal solution or whether different deployment scenarios will favour different approaches. KV sharing suits layers with redundant representations. Compressed attention suits contexts with exploitable structure. In production agent systems where context length and structure vary enormously across tasks, the winning architecture might be the one that dynamically selects between these techniques per layer per inference pass rather than committing to a single strategy at training time. thats a harder engineering problem than any individual approach here but its probably where the field converges, an adaptive attention budget that allocates efficiency method to context type in real time.
Laguna XS.2's layer-wise attention budgeting is already a step toward the per-layer selection you're describing, just static rather than chosen at runtime. The harder gap shows up in deployment: Google's TurboQuant hit 6x KV-cache compression in the lab but only https://thesynthesisai.substack.com/p/the-demand-destroyer. That spread is where "structure varies enormously across tasks" stops being a feature you exploit and starts being a tax you pay.
Thanks for another great article!
I think there might be a few small typos around the figure numbers, e.g.
```
The convolutional mixing part is not shown in Figure 12, because it would have been too crammed, but it’s relatively straightforward.
As hinted at in Figure 12,
```
I think maybe should refer to Figure 13
and then below
```
Next to the sequence mixing shown in Figure 13, there is also a channel mixing component.
```
should be Figure 14?
Good catch. I must have missed to bump the numbers there when I added an additional figure. Fixed it. Thanks!
Thanks for the article
I really hope you cover some new "interesting" training recipes as well in one of your future posts
Yes, it’s about time… Training is a harder topic though because of the $$$ required for doing something interesting plus many details are kept secret. But yeah, interesting topic!
Do you plan to do a deep dive in MTPs (qwen3.6 and Gemma4 for instance)?
Thanks for recommending. I covered MTP briefly in the past and it might be indeed worthwhile revisiting. Can’t promise yet though since there are some other things on my list as well
These architecture gains matter more for agents than for chat demos. Once an agent keeps rereading docs, diffs, tests, and logs, long-context efficiency stops being a benchmark detail and starts becoming product UX. Cheaper context is basically more room for reasoning plus verification.
Yes exactly, agents are very long-context demanding
thanks
Sebastian, this is a great reminder that LLM progress is not always visible from the chat interface. A model may look slightly better to users, but under the hood, changes in attention, MoE routing, KV cache efficiency, and long-context design can make a huge difference in cost, speed, and usability. The architecture layer is where a lot of the real competition is happening.
The clearest explanation of CSA/HCA I have ever seen, very easy to follow!
Thanks, glad to hear the visualizations helped!
Crazy , will need to check out later.
KV sharing is essentially the architecture discovering that some of what it stored was redundant — and compressing it away. Shannon would have called that removing entropy you didn't need. The deeper pattern here is that every generation of compute constraints forces the field to rediscover information theory from the bottom up. Curious whether you see the attention budget optimisations converging toward something like a channel capacity limit — a Shannon bound for transformers.
Not super related to the article but are you tracking any other fields of ai research, aside from models themselves? I've taken a huge interest in ontologies and storage as a way to cut down the inference costs of agentic systems. And hardware. Curious what's gotten your interest these days.
The convergence across all four architectures toward KV cache reduction tells you where the real constraint in the industry has moved. Training compute was the bottleneck for the 2020-2024 era. Inference efficiency at long context is the bottleneck for the agent era. Every architecture in this piece is solving the same problem from a different angle, how to keep attention costs sublinear as context windows expand into the hundreds of thousands of tokens that agentic workflows demand.
The interesting question is whether these are converging on a single optimal solution or whether different deployment scenarios will favour different approaches. KV sharing suits layers with redundant representations. Compressed attention suits contexts with exploitable structure. In production agent systems where context length and structure vary enormously across tasks, the winning architecture might be the one that dynamically selects between these techniques per layer per inference pass rather than committing to a single strategy at training time. thats a harder engineering problem than any individual approach here but its probably where the field converges, an adaptive attention budget that allocates efficiency method to context type in real time.
Laguna XS.2's layer-wise attention budgeting is already a step toward the per-layer selection you're describing, just static rather than chosen at runtime. The harder gap shows up in deployment: Google's TurboQuant hit 6x KV-cache compression in the lab but only https://thesynthesisai.substack.com/p/the-demand-destroyer. That spread is where "structure varies enormously across tasks" stops being a feature you exploit and starts being a tax you pay.
記事をありがとう!