Beyond Standard LLMs

Sebastian Raschka, PhD

Nov 4

Linear Attention Hybrids, Text Diffusion, Code World Models, and Small Recursive Transformers

Read →

20 Comments

Kai Liu

Nov 4

Great article, Sebastian! Thank you for your work!

Expand full comment

Daniel Kleine

Nov 5

I really enjoyed the overview! Just a quick note – from section 3.1 onward, some sentences seem to have line breaks that make the text a bit hard to read (I have noticed that especially in sections 3.1 and 3.3). Could you please take a quick look when you get a chance?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Nov 5

Thanks! And not sure how that happened, but I was copying back and forth from my local markdown editor, which may have caused this. The back and forth because, for the first time, I got an "Your post is too long and can't be saved. Please edit it to make it shorter or split part of it into another post." error in Substack :(

Expand full comment

Reply (1)

Daniel Kleine

Nov 5

Thank you!

Expand full comment

Chris Wendling

Nov 4Edited

Fantastic work Sebastian! Gives me hope that the solution will be found. I strongly encourage you to review some papers I’ve written on the subject- as you certainly have the background to understand:

Substack Archives —

https://chrispwendling.substack.com/archive

And-

http://www.itrac.com/EGM_Document_Index.htm

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Nov 4

Thanks for sharing! I can't promise to get to them soon, but I will add them to my reading list and check them out some time.

Expand full comment

Reply (1)

Chris Wendling

Nov 5

Thanks again Sebastian! If tou read nothing else- you must read this! https://substack.com/@chrispwendling/note/p-177999426?r=3hffor&utm_source=notes-share-action&utm_medium=web

Expand full comment

Kai Liu

Nov 4

I came to this and a little confused:

we now have a quadratic n_heads × d_head in here...

can you explain a lit-bit? thanks!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Nov 4

Good catch, I meant to type d_heads × d_head. Does this address your concern?

Expand full comment

Reply (2)

Daniel Kleine

Nov 5Edited

It should be "d_head × d_head" (without the s), right?

Could you please update this in the repo file as well?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Nov 6

Thanks & done!

Expand full comment

Kai Liu

Nov 4

Yes, thanks for your clarification!

Expand full comment

Ai Therapy Solutions

Nov 4

What are your thoughts about specialized foundational models. An llm but with specialized training - guardrails - human in the loop ?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Nov 4

Regarding specialization: I think that's in some sort what's already happening with code models, right?

Could you describe a bit more how the humans interact during training, and what the motivation here is?

Expand full comment

Ruben Hassid

Nov 13

We’re entering the post-LLM era. Not “smarter models.” Smarter stacks.

The winners won’t chase bigger benchmarks. They’ll orchestrate smaller systems that think faster, cheaper, and closer to the edge. One model routes. One reasons. One polishes. Together, they outperform giants.

This flips how you build. You stop prompting. You start composing. Every model becomes a teammate, not a tool.

As I wrote in Consultants, real leverage comes from sequencing intelligence, not scaling it.

Big isn’t the future. Interconnected is.

Expand full comment

Varun

Nov 9

Thanks for the great article!

I do hope you get the time to write a deep-dive on Mamba & DeltaNet architectures someday. Would love to learn more deeply about them!

Expand full comment

Aishwarya Agrawal

Nov 7

It was a great read, Sebastian! I really learnt so much about whats going on in this space and alternates to standard LLMs in such a lucid way! Definitely excited to see more articles about each of these sections in detail...

Expand full comment

Antonina

Nov 7

Thank you for the overview!

A small note: the link to "Diffusion‑LM Improves Controllable Text Generation" paper in 3.1 leads to another paper. The correct one is probably this: https://arxiv.org/abs/2205.14217

Expand full comment

APS

Nov 6

Sometimes I feel like a hermit on a rock just sat waiting for someone to bring me the news that they've finally solved SSMs as a paradigm and they're going to come and blast us past the now-generally-accepted inherent limitations of regular transformers.

Expand full comment

praxis22

Nov 4

An excellent way to walk home. Thanks!

Expand full comment