27 Comments
User's avatar
Kai Liu's avatar

Great article, Sebastian! Thank you for your work!

tzc's avatar

Thank you very much for this post!

I have one question about the code in Section 2.5 (gated attention). I thought the gate and the context would have the same shape, but in line 100 the context is (b, num_tokens, self.num_heads, self.head_dim) while the gate is (b, self.num_heads, num_tokens, self.head_dim).

Is this difference intentional, or might it be a small mistake?

Thanks again for sharing!

Sebastian Raschka, PhD's avatar

Thanks, I think you are correct, there shouldn't be a `gate = gate.transpose(1, 2)`

Varun's avatar

Thanks for the great article!

I do hope you get the time to write a deep-dive on Mamba & DeltaNet architectures someday. Would love to learn more deeply about them!

Aishwarya Agrawal's avatar

It was a great read, Sebastian! I really learnt so much about whats going on in this space and alternates to standard LLMs in such a lucid way! Definitely excited to see more articles about each of these sections in detail...

Antonina's avatar

Thank you for the overview!

A small note: the link to "Diffusion‑LM Improves Controllable Text Generation" paper in 3.1 leads to another paper. The correct one is probably this: https://arxiv.org/abs/2205.14217

Sebastian Raschka, PhD's avatar

Good catch, thanks! Missed this comment earlier and just updated it.

Daniel Kleine's avatar

I really enjoyed the overview! Just a quick note – from section 3.1 onward, some sentences seem to have line breaks that make the text a bit hard to read (I have noticed that especially in sections 3.1 and 3.3). Could you please take a quick look when you get a chance?

Sebastian Raschka, PhD's avatar

Thanks! And not sure how that happened, but I was copying back and forth from my local markdown editor, which may have caused this. The back and forth because, for the first time, I got an "Your post is too long and can't be saved. Please edit it to make it shorter or split part of it into another post." error in Substack :(

praxis22's avatar

An excellent way to walk home. Thanks!

Chris Wendling's avatar

Fantastic work Sebastian! Gives me hope that the solution will be found. I strongly encourage you to review some papers I’ve written on the subject- as you certainly have the background to understand:

Substack Archives —

https://chrispwendling.substack.com/archive

And-

http://www.itrac.com/EGM_Document_Index.htm

Sebastian Raschka, PhD's avatar

Thanks for sharing! I can't promise to get to them soon, but I will add them to my reading list and check them out some time.

Kai Liu's avatar

I came to this and a little confused:

we now have a quadratic n_heads × d_head in here...

can you explain a lit-bit? thanks!

Sebastian Raschka, PhD's avatar

Good catch, I meant to type d_heads × d_head. Does this address your concern?

Daniel Kleine's avatar

It should be "d_head × d_head" (without the s), right?

Could you please update this in the repo file as well?

Kai Liu's avatar

Yes, thanks for your clarification!

Ai Therapy Solutions's avatar

What are your thoughts about specialized foundational models. An llm but with specialized training - guardrails - human in the loop ?

Sebastian Raschka, PhD's avatar

Regarding specialization: I think that's in some sort what's already happening with code models, right?

Could you describe a bit more how the humans interact during training, and what the motivation here is?

Mitch Klein's avatar

Check this out. This is beyond standard. New Epistemic system incoming. https://mitchklein.substack.com/p/putting-theory-of-mind-to-work?r=b162

siyu's avatar

Looks like you have the same (similar) code for both "Gated Attention" vs. "Gated delta net attention". At least the subroutines have the same name, and I'm having trouble seeing the difference between the 2 implementations

Szymon Palucha's avatar

Another excellent article! Thanks a lot for the complex content in quick, digestible and easy to understand explanations!

HuangTing's avatar

Does the code for part2.5 has some problems?

Ruben Hassid's avatar

We’re entering the post-LLM era. Not “smarter models.” Smarter stacks.

The winners won’t chase bigger benchmarks. They’ll orchestrate smaller systems that think faster, cheaper, and closer to the edge. One model routes. One reasons. One polishes. Together, they outperform giants.

This flips how you build. You stop prompting. You start composing. Every model becomes a teammate, not a tool.

As I wrote in Consultants, real leverage comes from sequencing intelligence, not scaling it.

Big isn’t the future. Interconnected is.

APS's avatar

Sometimes I feel like a hermit on a rock just sat waiting for someone to bring me the news that they've finally solved SSMs as a paradigm and they're going to come and blast us past the now-generally-accepted inherent limitations of regular transformers.