The Big LLM Architecture Comparison

Sebastian Raschka, PhD

Jul 19

From DeepSeek-V3 to Kimi K2: A Look At Modern LLM Architecture Design

68 Comments

Apart from the architectural differences, what would be interesting to know is on which text data the LLMs have been trained on. From my pov, it's really unfortunate that this info is typically not disclosed, even for open-source LLMs. Not just the amount of training data (e.g. number of tokens) but also the data quality as factors for a true scientific comparison.

Expand full comment

Sebastian Raschka, PhD

Agreed. I summarized some of approaches last year (https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training), but it's tricky due to the lack of full disclosure. Btw on that note I recently stumbled upon https://libremodel.xyz/, which aims to be transparent in that regard. It's not going to be a SOTA model, but still interesting.

Expand full comment

With a budget of under $1,000, wow!

I found the recent SmolLM3 data and training procedure quite interesting: https://huggingface.co/blog/smollm3#data-mixture-and-training-stages

Expand full comment

Amazing article! This is evidence that we haven't hit a wall yet with LLMs as all these labs haven't converged to the same architectures.

Cohere Labs is also doing some great work for open source and have some interesting work. I feel a lot of people don't know who they are as they are trying to appeal to businesses/governments.

Expand full comment

Sebastian Raschka, PhD

Good point. Cohere flies a bit under the radar in the open-weight LLM circles. Maybe because of the enterprise focus that you mentioned. I think their Command model is also > 1.5 years old now and more RAG-focused so I didn't include it (but please correct me if I'm wrong).

Expand full comment

Aug 28Edited

Really amazing and very helpful!

Thank you so much for consistently sharing such valuable articles.

One more thing I must tell the whole world that your book, Build a Large Language Model(From Scratch), is definitely worth a read! It was the true catalyst that sparked my journey into the LLM field!

Expand full comment

Sebastian Raschka, PhD

Thanks for the kind words and also kindly recommending my book! It’s awesome to hear that it’s been a career-starter!

Expand full comment

Amazing article!

Expand full comment

I appreciate all the work you are putting in compiling this information

Expand full comment

Jul 20Edited

> MLA is a clever trick to reduce KV cache memory use while even slightly outperforming MHA in terms of modeling performance.

What’s the intuition for why MLA improves performance? It seems that compressing would if anything be slightly lossy.

Is this just a non-stat-sig result and we should just say “no evidence that it’s worse”? Or is there some mechanical reason that it is indeed better?

Expand full comment

Sebastian Raschka, PhD

Good question. This is based on the results shown in Figure 4. You are right, you’d expect slightly worse performance as it’s a workaround/approximation. My guess it’s because the additional “layer” adds more expressiveness.

Expand full comment

Great overview!

As a small side note, I noticed that in Fig. 4, the bottom left comment appears to read 'MQA.' Should this perhaps be 'MLA' instead?

Expand full comment

Sebastian Raschka, PhD

Good catch, thanks!

Expand full comment

Jul 19Edited

I might have found another small nitpick: In Fig. 19, on the right side, the comment for the intermediate layer dimension should be 1536 (as it must be divisible by 8). Would you mind checking if this is correct?

Expand full comment

Sebastian Raschka, PhD

Thanks, that should indeed be 1536 not 1535 (I need to start wearing glasses)

Expand full comment

Jul 19Edited

:D Thanks! Could you please also update Fig. 1 for Qwen3?

I have also noticed that the arrows on the left side of Fig. 10 seem a bit misaligned. Would you mind taking a look to see if they could be adjusted for clarity?

BTW I like really the visual comparisons of the architectures!

Expand full comment

Sebastian Raschka, PhD

👍👍

Expand full comment

Thanks!

Expand full comment

Thank you so much for your interesting and helpful article.

Expand full comment

Amazing article, could not finish in one sitting, bookmarked it to listen while walking. Now your articles will be in my listen list, thank you so much for putting so much effort to make it easy to consume and understand.

Expand full comment

Sebastian Raschka, PhD

Thanks for the kind words, glad that it is interesting enough to bookmark and return to!

Expand full comment

Great article! has the plan to talk about diffusion based llms ?

Expand full comment

Sebastian Raschka, PhD

Thanks! And this is actually a topic on my list that I want to cover separately one day.

Expand full comment

Sep 15Edited

Thanks for the article! Just a small correction, Gemma 3 uses GELU not SiLU in the feed forward (in the figure comparing Gemma 3 27B to Mistral Small 3.1 24B).

Expand full comment

Sebastian Raschka, PhD

Thanks for the note! You are absolutely right, must have been a copy&paste error or so (I coincidentally had a section on GELU vs SiLU in my most recent article). Anyways, should be fixed now!

Expand full comment

Thanks. This is insanely useful for study

Expand full comment

Very Amazing article.!!! Thanks, but I have a question, Qwen3-235B-A22B employs a fully integrated Mixture-of-Experts (MoE) architecture, rather than alternating between dense and MoE layers every two layers.

Expand full comment

Sebastian Raschka, PhD

Thanks for the note, you are absolutely right. This must have been a copy-paste error from the Llama 4 architecture. Just fixed it.

Expand full comment

Very Helpful. Thanks

Expand full comment

this is fantastic

Expand full comment

nice

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts