30 Comments
User's avatar
Dr. Ashish Bamania's avatar

Love this article! Quite helpful as always!

Expand full comment
Zupo Llask's avatar

Dear Sebastian,

What about HRM?

https://github.com/sapientinc/HRM

https://arxiv.org/abs/2506.21734

Do you already have a say about it?

TY for all your work! 🙏

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the ping! I have it on my ever-growing list but haven’t had a chance yet.

Expand full comment
Sepehr Akbari's avatar

Fantastic comparison and highlights! Thank you!

Expand full comment
Eric Engle's avatar

gpt-oss sucks.

qwen’s better.

gpt-oss sucks because it hallucinates All The Time.

Expand full comment
Sebastian Raschka, PhD's avatar

Yeah, I think it’s more meant to be used as a code and logic reasoning model. And for knowledge facts it’s supposed to be tool-calling (retrieving info from search queries). Or did you find that it hallucinates on coding task as well?

Expand full comment
Eric Engle's avatar

i didn't test it for coding, that's a secondary use case and Phi4, Codestral, and Codellama definitely perform well, so why bother? I mean, if it screws up text generation why check the rest of it, especially since Phi4 is great, codestral good and even codellama is nice (I use all three because one does somethings better than the others).

Expand full comment
Sebastian Raschka, PhD's avatar

I think they probably wanted to achieve good benchmarks scores, and a model of limited size can’t has limited capacity. And they seem to have prioritized reasoning. Likely they want to address the knowledge issue with tool calling. I mean, it makes sense because this way the model also doesn’t get outdated so quickly. But the open source ecosystem and tooling is not quite ready for tool calling.

Expand full comment
Eric Engle's avatar

we can imagine all sorts of reasons gpt-oss sucks. who cares? it sucks! it generates extensive long form content — too bad it’s chock full of ribald hallucinations. It’s unusable. Maybe the idea is “you can fine tune it, hit this dog in the ass long enough and we swear it will go fast and be vicious” but i’m not even gonna try to fine tune on it.

Expand full comment
kevin's avatar

great!

Expand full comment
Matt Ma's avatar

Amazing work. Thanks!!

Expand full comment
StNick's avatar

How does training work for MoE? Like why don’t all the experts just end up learning similar weights?

Expand full comment
Sebastian Raschka, PhD's avatar

Could be an interesting article in itself! The short answer is the same reasons all neurons in a linear layer end up computing different activations: they start out with different random weights. In addition, some also add extra regularization terms to the loss to encourage more diversity.

Expand full comment
StNick's avatar

Thanks! The phrase expert made me think there is something done to force them to learn specific things - like fine tuning each one on a subset of the training dataset. I guess that’s not the case though?

Expand full comment
Anshuman's avatar

Why are open weight model so important and causing this buzz? OpenAI didn't release their training procedure etc

Expand full comment
Sebastian Raschka, PhD's avatar

My personal reasons are (1) I can run this (the 20b) locally on my Mac Mini, which is great when working with sensitive data. (2) I can use it as a base model for research purposes when tinkering with training algos (although I prefer the smaller Qwen3 models here due to compute costs)

Expand full comment
Daniel Kleine's avatar

Great summary and thoughtful analysis, including on points like attention sinks and MXFP4!

One thing I’m trying to clarify is this sentence:

"An intermediate expert (feed-forward) projection dimension of 5760 instead of 768."

Could you explain why the MoE expert width is much larger (doubled) in GPT-OSS (e.g., 5,760/2,880), but in Qwen3 the expert dimension listed is comparatively small (e.g., 768) and not doubled?

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks! And yeah, this is one of the interesting points here. I.e. for dense models we usually expand vs shrink the hidden dimension in the feed forward layer. However, I'd say for MoEs it's not unusual to do the opposite to reduce parameters. For example, DeepSeek V3 did that too: 4096 -> 1536 -> 4096. I think it's mainly a consequence of using many small vs few large experts. For the former you shrink, for the latter you expand.

Expand full comment
Daniel Kleine's avatar

I see, thanks!

Expand full comment
Benjamin Riley's avatar

"This may stem from its heavy training focus on reasoning tasks such as math, puzzles, and code, which could have led to some 'general knowledge forgetting.'"

Could you elaborate on this point? Why would focused training on reasoning tasks lead to degradation of general capability?

Expand full comment
Sebastian Raschka, PhD's avatar

Good question. It’s related to (catastrophic) forgetting neural nets in general. Ie if you have a fixed capacity to learn information, certain information will lead to forgetting other information. Or more pragmatically: if you update the weights to improve reasoning, who knows how the weight updates affect information that was learned during pre-training; it might corrupt it a bit.

A good example of this is the “LoRA learns less but forgets less” paper (covered it here: https://magazine.sebastianraschka.com/p/llm-research-insights-instruction) that also shows it for general pre-training: training the model more on math makes the model worse at code and vice versa. It’s basically a no-free-lunch issue.

Expand full comment
Benjamin Riley's avatar

Many thanks. It strikes me this is a pretty significant difference between AI models and human intelligence -- as far as we can tell there's no limit to our long-term memory. Of course, we also don't have to read the entire Internet.

Expand full comment
Sebastian Raschka, PhD's avatar

I don’t think LLMs work anything like human brains. But that being said, I do think humans suffer from the same issue. If you’d ask me on the spot to retake any of college exams without any revisiting of the material, I’d probably fail

Expand full comment
Andy Andurkar's avatar

Great blog, Thanks, You are amazing. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side., Intermediate Projection size should 8 * 768 , is it correct ?

Expand full comment
Vincenzo Agrillo's avatar

MXFP4 was used already in the training path? Or After the last step of training (RL)?

Expand full comment
Ram's avatar

I appreciate it but for me it is too technical. Please continue to post.

Expand full comment
Gonçalo Perdigão's avatar

This is the best article and explanation I have read so far. Clear, insightful, and wonderfully written. The illustrations are especially helpful in making the concepts easy to grasp. I would be very interested to hear your thoughts on hybrid systems, particularly the idea of complementing LLMs and Transformers with tools for handling large-scale numerical computations and models that replicate the physical laws of the world, as in some of Wolfram’s proposals

Expand full comment
P.M.SALMAN KHAN's avatar

Amazing Article, being new to this ! It was so intuitive how some x and y are choosen purely to improve efficiency and reduce computation.

Thank you !

Expand full comment
hank's avatar

Fantastic!

Expand full comment