The Big LLM Architecture Comparison

Sebastian Raschka, PhD

Jul 19

823

From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design

Read →

35 Comments

Leo Benaharon

Jul 19

Amazing article! This is evidence that we haven't hit a wall yet with LLMs as all these labs haven't converged to the same architectures.

Cohere Labs is also doing some great work for open source and have some interesting work. I feel a lot of people don't know who they are as they are trying to appeal to businesses/governments.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 19

Good point. Cohere flies a bit under the radar in the open-weight LLM circles. Maybe because of the enterprise focus that you mentioned. I think their Command model is also > 1.5 years old now and more RAG-focused so I didn't include it (but please correct me if I'm wrong).

Expand full comment

Daniel Kleine

Jul 20

Apart from the architectural differences, what would be interesting to know is on which text data the LLMs have been trained on. From my pov, it's really unfortunate that this info is typically not disclosed, even for open-source LLMs. Not just the amount of training data (e.g. number of tokens) but also the data quality as factors for a true scientific comparison.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 21

Agreed. I summarized some of approaches last year (https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training), but it's tricky due to the lack of full disclosure. Btw on that note I recently stumbled upon https://libremodel.xyz/, which aims to be transparent in that regard. It's not going to be a SOTA model, but still interesting.

Expand full comment

Reply (1)

Daniel Kleine

Jul 21

With a budget of under $1,000, wow!

I found the recent SmolLM3 data and training procedure quite interesting: https://huggingface.co/blog/smollm3#data-mixture-and-training-stages

Expand full comment

Lirio

Jul 29

Amazing article!

Expand full comment

Paul T

Jul 20Edited

> MLA is a clever trick to reduce KV cache memory use while even slightly outperforming MHA in terms of modeling performance.

What’s the intuition for why MLA improves performance? It seems that compressing would if anything be slightly lossy.

Is this just a non-stat-sig result and we should just say “no evidence that it’s worse”? Or is there some mechanical reason that it is indeed better?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 20

Good question. This is based on the results shown in Figure 4. You are right, you’d expect slightly worse performance as it’s a workaround/approximation. My guess it’s because the additional “layer” adds more expressiveness.

Expand full comment

Daniel Kleine

Jul 19

Great overview!

As a small side note, I noticed that in Fig. 4, the bottom left comment appears to read 'MQA.' Should this perhaps be 'MLA' instead?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 19

Good catch, thanks!

Expand full comment

Reply (1)

Daniel Kleine

Jul 19Edited

I might have found another small nitpick: In Fig. 19, on the right side, the comment for the intermediate layer dimension should be 1536 (as it must be divisible by 8). Would you mind checking if this is correct?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 19

Thanks, that should indeed be 1536 not 1535 (I need to start wearing glasses)

Expand full comment

Reply (1)

Daniel Kleine

Jul 19Edited

:D Thanks! Could you please also update Fig. 1 for Qwen3?

I have also noticed that the arrows on the left side of Fig. 10 seem a bit misaligned. Would you mind taking a look to see if they could be adjusted for clarity?

BTW I like really the visual comparisons of the architectures!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 19

👍👍

Expand full comment

Reply (1)

Daniel Kleine

Jul 20

Thanks!

Expand full comment

Paolo Perrone

this is fantastic

Expand full comment

rizkhi_33

nice

Expand full comment

active_sky

Are you interested in explaining the principles of RoPE (which has almost become a de facto standard in every architecture mentioned in your articles)? The mathematical derivation process is really giving me a headache.😭

Expand full comment

Reply (1)

Sebastian Raschka, PhD

One day! I have a long list of things I want to do but only so little time 😅

Expand full comment

Reply (1)

active_sky

Thank you for your reply！

Expand full comment

Jul 25

are you sure that the qwen moe has a dense MLP in every other layer?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 25

Good callout. That was from sth else. Afaik they don’t have dense MLP layers

Expand full comment

vishal chauhan

Jul 22

Great insight! I think the biggest leap forward will come from architectural changes that eliminate the auto-regressive nature of current LLMs. Curious to see who comes up with clever solutions to break that paradigm.

Expand full comment

Dante

Jul 21

I appreciate all the work you are putting in compiling this information

Expand full comment

James Zhang

Jul 20

I’m still reading through, but I’m already amazed. How has your recovery from the back injury been going?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 21

Glad you are liking it! I am already doing much better but yeah, it's still quite a journey. I basically adapted my setup for writing and coding (doing it lying down with a split keyboard). Compared to how it was a few months ago, I can't complain, though!

Expand full comment

Nimish Sanghi

Jul 20

Amazing article, densly packed and yet easy to read - covers all the popular variations and optimization tricks

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 20

Thanks!!

Expand full comment

Devansh

Jul 19

Another masterpiece Seb. Would you be interested in guest posting this on my newsletter? You don't have to do much, if you have a g doc I can copy over the article as is and then write a quick introduction about you /drop your links.

Another question -- what are your thoughts on what the next generation of models look like. As you said things have been surprisingly similar. Any bets on what aspects of the LLM get reimagined first? My bet has been on the rise of diffusion models. Would love to hear from you.

Expand full comment

Daniel

Jul 24

I think future models won’t choose between these architectures — they’ll merge them. Monoliths for behavioral reliability, MoE for cost, and SSMs to stretch context length. Architecture isn’t a battle of schools — it’s an engineering compromise.

like described here:

https://trilogyai.substack.com/p/ai-ping-pong

or here

https://trilogyai.substack.com/p/ai-discovery

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 24

Hm but one monolithic model to perform all kinds of task well (jack of all trades) will never do as well as the same monolithic model fine-and preference -tuned on a subarea (like code). For companies who want to provide the best user experience and performance in a certain field, it does make sense to specialize the model.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 24Edited

The LoRA Learns Less But Forgets Less paper (https://arxiv.org/abs/2405.09673) shows this nicely for both LoRA and regular training: Training more on math makes the model worse on code problems, and training more on code problems makes the model worse at math.

Expand full comment

Rikeijin

Jul 21

The first figure's font is so small, and makes it difficult to read.

Expand full comment

Reply (2)

Sebastian Raschka, PhD

Jul 21

Thanks for the feedback. I planned to only have this figure as an overview or preview of what's ahead (not to be read in detail at this point, because it would perhaps be to confusing without additional explanation). But as Daniel Kleine mentioned, you could click on the figure to see a larger version.

Expand full comment

Daniel Kleine

Jul 21

You can click on the figure, displaying it in full resolution

Expand full comment