16 Comments
User's avatar
Leo Benaharon's avatar

Amazing article! This is evidence that we haven't hit a wall yet with LLMs as all these labs haven't converged to the same architectures.

Cohere Labs is also doing some great work for open source and have some interesting work. I feel a lot of people don't know who they are as they are trying to appeal to businesses/governments.

Expand full comment
Sebastian Raschka, PhD's avatar

Good point. Cohere flies a bit under the radar in the open-weight LLM circles. Maybe because of the enterprise focus that you mentioned. I think their Command model is also > 1.5 years old now and more RAG-focused so I didn't include it (but please correct me if I'm wrong).

Expand full comment
Daniel Kleine's avatar

Great overview!

As a small side note, I noticed that in Fig. 4, the bottom left comment appears to read 'MQA.' Should this perhaps be 'MLA' instead?

Expand full comment
Sebastian Raschka, PhD's avatar

Good catch, thanks!

Expand full comment
Daniel Kleine's avatar

I might have found another small nitpick: In Fig. 19, on the right side, the comment for the intermediate layer dimension should be 1536 (as it must be divisible by 8). Would you mind checking if this is correct?

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks, that should indeed be 1536 not 1535 (I need to start wearing glasses)

Expand full comment
Daniel Kleine's avatar

:D Thanks! Could you please also update Fig. 1 for Qwen3?

I have also noticed that the arrows on the left side of Fig. 10 seem a bit misaligned. Would you mind taking a look to see if they could be adjusted for clarity?

BTW I like really the visual comparisons of the architectures!

Expand full comment
Sebastian Raschka, PhD's avatar

👍👍

Expand full comment
Daniel Kleine's avatar

Thanks!

Expand full comment
Paul T's avatar
11hEdited

> MLA is a clever trick to reduce KV cache memory use while even slightly outperforming MHA in terms of modeling performance.

What’s the intuition for why MLA improves performance? It seems that compressing would if anything be slightly lossy.

Is this just a non-stat-sig result and we should just say “no evidence that it’s worse”? Or is there some mechanical reason that it is indeed better?

Expand full comment
Sebastian Raschka, PhD's avatar

Good question. This is based on the results shown in Figure 4. You are right, you’d expect slightly worse performance as it’s a workaround/approximation. My guess it’s because the additional “layer” adds more expressiveness.

Expand full comment
Nimish Sanghi's avatar

Amazing article, densly packed and yet easy to read - covers all the popular variations and optimization tricks

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks!!

Expand full comment
Devansh's avatar

Another masterpiece Seb. Would you be interested in guest posting this on my newsletter? You don't have to do much, if you have a g doc I can copy over the article as is and then write a quick introduction about you /drop your links.

Another question -- what are your thoughts on what the next generation of models look like. As you said things have been surprisingly similar. Any bets on what aspects of the LLM get reimagined first? My bet has been on the rise of diffusion models. Would love to hear from you.

Expand full comment
James Zhang's avatar

I’m still reading through, but I’m already amazed. How has your recovery from the back injury been going?

Expand full comment
Daniel Kleine's avatar

Apart from the architectural differences, what would be interesting to know is on which text data the LLMs have been trained on. From my pov, it's really unfortunate that this info is typically not disclosed, even for open-source LLMs. Not just the amount of training data (e.g. number of tokens) but also the data quality as factors for a true scientific comparison.

Expand full comment