I really enjoyed the overview! Just a quick note – from section 3.1 onward, some sentences seem to have line breaks that make the text a bit hard to read (I have noticed that especially in sections 3.1 and 3.3). Could you please take a quick look when you get a chance?
Thanks! And not sure how that happened, but I was copying back and forth from my local markdown editor, which may have caused this. The back and forth because, for the first time, I got an "Your post is too long and can't be saved. Please edit it to make it shorter or split part of it into another post." error in Substack :(
Fantastic work Sebastian! Gives me hope that the solution will be found. I strongly encourage you to review some papers I’ve written on the subject- as you certainly have the background to understand:
We’re entering the post-LLM era. Not “smarter models.” Smarter stacks.
The winners won’t chase bigger benchmarks. They’ll orchestrate smaller systems that think faster, cheaper, and closer to the edge. One model routes. One reasons. One polishes. Together, they outperform giants.
This flips how you build. You stop prompting. You start composing. Every model becomes a teammate, not a tool.
As I wrote in Consultants, real leverage comes from sequencing intelligence, not scaling it.
It was a great read, Sebastian! I really learnt so much about whats going on in this space and alternates to standard LLMs in such a lucid way! Definitely excited to see more articles about each of these sections in detail...
A small note: the link to "Diffusion‑LM Improves Controllable Text Generation" paper in 3.1 leads to another paper. The correct one is probably this: https://arxiv.org/abs/2205.14217
Sometimes I feel like a hermit on a rock just sat waiting for someone to bring me the news that they've finally solved SSMs as a paradigm and they're going to come and blast us past the now-generally-accepted inherent limitations of regular transformers.
Great article, Sebastian! Thank you for your work!
I really enjoyed the overview! Just a quick note – from section 3.1 onward, some sentences seem to have line breaks that make the text a bit hard to read (I have noticed that especially in sections 3.1 and 3.3). Could you please take a quick look when you get a chance?
Thanks! And not sure how that happened, but I was copying back and forth from my local markdown editor, which may have caused this. The back and forth because, for the first time, I got an "Your post is too long and can't be saved. Please edit it to make it shorter or split part of it into another post." error in Substack :(
Thank you!
Fantastic work Sebastian! Gives me hope that the solution will be found. I strongly encourage you to review some papers I’ve written on the subject- as you certainly have the background to understand:
Substack Archives —
https://chrispwendling.substack.com/archive
And-
http://www.itrac.com/EGM_Document_Index.htm
Thanks for sharing! I can't promise to get to them soon, but I will add them to my reading list and check them out some time.
Thanks again Sebastian! If tou read nothing else- you must read this! https://substack.com/@chrispwendling/note/p-177999426?r=3hffor&utm_source=notes-share-action&utm_medium=web
I came to this and a little confused:
we now have a quadratic n_heads × d_head in here...
can you explain a lit-bit? thanks!
Good catch, I meant to type d_heads × d_head. Does this address your concern?
It should be "d_head × d_head" (without the s), right?
Could you please update this in the repo file as well?
Thanks & done!
Yes, thanks for your clarification!
What are your thoughts about specialized foundational models. An llm but with specialized training - guardrails - human in the loop ?
Regarding specialization: I think that's in some sort what's already happening with code models, right?
Could you describe a bit more how the humans interact during training, and what the motivation here is?
We’re entering the post-LLM era. Not “smarter models.” Smarter stacks.
The winners won’t chase bigger benchmarks. They’ll orchestrate smaller systems that think faster, cheaper, and closer to the edge. One model routes. One reasons. One polishes. Together, they outperform giants.
This flips how you build. You stop prompting. You start composing. Every model becomes a teammate, not a tool.
As I wrote in Consultants, real leverage comes from sequencing intelligence, not scaling it.
Big isn’t the future. Interconnected is.
Thanks for the great article!
I do hope you get the time to write a deep-dive on Mamba & DeltaNet architectures someday. Would love to learn more deeply about them!
It was a great read, Sebastian! I really learnt so much about whats going on in this space and alternates to standard LLMs in such a lucid way! Definitely excited to see more articles about each of these sections in detail...
Thank you for the overview!
A small note: the link to "Diffusion‑LM Improves Controllable Text Generation" paper in 3.1 leads to another paper. The correct one is probably this: https://arxiv.org/abs/2205.14217
Sometimes I feel like a hermit on a rock just sat waiting for someone to bring me the news that they've finally solved SSMs as a paradigm and they're going to come and blast us past the now-generally-accepted inherent limitations of regular transformers.
An excellent way to walk home. Thanks!