Really enjoyed this architectural deep-dive — the side-by-side diagrams are genuinely the clearest way to internalize how much design debt is being paid down across the field right now.
The observation about MiniMax M2.5 sticking with plain GQA was what stood out most to me. There's something almost contrarian about choosing simplicity when everyone else is racing toward hybrid linear attention. I'd be curious whether that translates into easier fine-tuning or more predictable scaling behavior in practice.
The note on Tiny Aya dropping QK-Norm for long-context reasons is also a good reminder that "training stability" and "inference behavior" aren't always aligned goals — would love to see an ablation on that tradeoff somewhere.
Sebastian, excellent breakdown of open-weight LLM architectures in Jan-Feb 2026—love the inference-time scaling categories. As a coder, inference speed hacks are game-changers for local agents. Which one do you think will dominate vibe coding next? Your takes always ahead! 📈 #AI2026
The Attention block in Figure 1 is a bit confusing. The right Order from top-down should be:
Gated Attention
RoPE+NoPE
QKNorm
And also, shouldn't be like 'RoPE+NoPE', since they are used depending on which Attention algorithm, like GQA or SWA. So I feel like the correct way to illustrate should be 'RoPE or NoPE'.
Actually I didn't mean to imply a particular order. Originally, when I started drawing these diagrams years ago, I just had a RoPE box there. Then I added QK-Norm for Olmo etc. And then for this one I just added Gated Attention below where there was still space. But I can see that it may be confusing in case a specific order is expected. This is actually a good point, and I will order it in the future.
Regarding RoPE + NoPE I also agree. The + was a shorthand for both being used, but also here it can be a bit misleading because you wouldn't use both at the same time, it's either or.
This roundup aged beautifully with Gemma 4 drop this week. The MoE efficiency direction flagged here is exactly where Google landed — their 26B-A4B model activates just 3.8 billion parameters during inference, roughly 15% of total weights. People are already running it locally on laptops via LM Studio, which would've been unthinkable for a model this capable a year ago.
Also Qwen3-Coder-Next at 3B active params beats DeepSeek V3.2 at 37B active — not incremental improvement -- entirely different cost curve. The competitive moat has shifted from "how many parameters can you train" to "how few can you activate per useful token."
Curious whether you think the MoE sparsity gains plateau at some activation ratio, or if we'll keep seeing competitive models push below 15% activation. The inference cost implications are enormous either way.
Always open to suggestions — growth comes from refining ideas. Curious what directions you’re exploring.If you find my content useful, I’d appreciate a subscription and reactions on my posts — it makes a real difference.
Really enjoyed this architectural deep-dive — the side-by-side diagrams are genuinely the clearest way to internalize how much design debt is being paid down across the field right now.
The observation about MiniMax M2.5 sticking with plain GQA was what stood out most to me. There's something almost contrarian about choosing simplicity when everyone else is racing toward hybrid linear attention. I'd be curious whether that translates into easier fine-tuning or more predictable scaling behavior in practice.
The note on Tiny Aya dropping QK-Norm for long-context reasons is also a good reminder that "training stability" and "inference behavior" aren't always aligned goals — would love to see an ablation on that tradeoff somewhere.
Looking forward to the DeepSeek V4 addition!
There is a typo in Heading 8: Qwen3.5 and the Continutation of Hybrid Attention. It should be Continuation.
Thanks, should be fixed now!
Sebastian, excellent breakdown of open-weight LLM architectures in Jan-Feb 2026—love the inference-time scaling categories. As a coder, inference speed hacks are game-changers for local agents. Which one do you think will dominate vibe coding next? Your takes always ahead! 📈 #AI2026
Thanks! Let me revisit this question once DeepSeek V4 is released 😊
Great post,would love to see the breakdown of Sarvam models once they are out!
There is a typo in Figure 12, it should be 11B instead of 37B parameters are active for the Step 3.5 Flash.
Thanks for the note. Must have been a copy&paste error, but I just fixed it.
The Attention block in Figure 1 is a bit confusing. The right Order from top-down should be:
Gated Attention
RoPE+NoPE
QKNorm
And also, shouldn't be like 'RoPE+NoPE', since they are used depending on which Attention algorithm, like GQA or SWA. So I feel like the correct way to illustrate should be 'RoPE or NoPE'.
Thanks for the feedback! Regarding
"""
Gated Attention
RoPE+NoPE
QKNorm
"""
Actually I didn't mean to imply a particular order. Originally, when I started drawing these diagrams years ago, I just had a RoPE box there. Then I added QK-Norm for Olmo etc. And then for this one I just added Gated Attention below where there was still space. But I can see that it may be confusing in case a specific order is expected. This is actually a good point, and I will order it in the future.
Regarding RoPE + NoPE I also agree. The + was a shorthand for both being used, but also here it can be a bit misleading because you wouldn't use both at the same time, it's either or.
This roundup aged beautifully with Gemma 4 drop this week. The MoE efficiency direction flagged here is exactly where Google landed — their 26B-A4B model activates just 3.8 billion parameters during inference, roughly 15% of total weights. People are already running it locally on laptops via LM Studio, which would've been unthinkable for a model this capable a year ago.
Also Qwen3-Coder-Next at 3B active params beats DeepSeek V3.2 at 37B active — not incremental improvement -- entirely different cost curve. The competitive moat has shifted from "how many parameters can you train" to "how few can you activate per useful token."
Curious whether you think the MoE sparsity gains plateau at some activation ratio, or if we'll keep seeing competitive models push below 15% activation. The inference cost implications are enormous either way.
Always open to suggestions — growth comes from refining ideas. Curious what directions you’re exploring.If you find my content useful, I’d appreciate a subscription and reactions on my posts — it makes a real difference.