43 Comments
User's avatar
Dr. Ashish Bamania's avatar

Love this article! Quite helpful as always!

Expand full comment
Zupo Llask's avatar

Dear Sebastian,

What about HRM?

https://github.com/sapientinc/HRM

https://arxiv.org/abs/2506.21734

Do you already have a say about it?

TY for all your work! 🙏

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the ping! I have it on my ever-growing list but haven’t had a chance yet.

Expand full comment
Vincenzo Agrillo's avatar

MXFP4 was used already in the training path? Or After the last step of training (RL)?

Expand full comment
Sebastian Raschka, PhD's avatar

Good question. They are unfortunately a bit sparse with the training details. According to the model card paper, they say

> We post-trained the models with quantization of the MoE weights to MXFP4 format

But it's unclear if that's the entire post-training, or a round of post-training after the original post-training.

Expand full comment
Gonçalo Perdigão's avatar

This is the best article and explanation I have read so far. Clear, insightful, and wonderfully written. The illustrations are especially helpful in making the concepts easy to grasp. I would be very interested to hear your thoughts on hybrid systems, particularly the idea of complementing LLMs and Transformers with tools for handling large-scale numerical computations and models that replicate the physical laws of the world, as in some of Wolfram’s proposals

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks! I must admit that I am not familiar with Wolfram's approach. But I remember that early versions of AlphaFold (which is also based on the transformer architecture) incorporated physical constraints in the early versions but then later abandoned it. I.e., instead of hardcoding physical laws, they let the models learn those implicitly from the training data.

I think hard-coding things is tricky. Maybe a good work-around is to encourage more tool-calling by the LLM, which can use calculators, code, and in the future perhaps specific simulators. Actually, gpt-oss is meant to be used with tool calling, but the open-source tools running gpt-oss don't support it out of the box (yet)

Expand full comment
Daniel Gutierrez's avatar

Unbelievable and fantastic analysis, so very helpful. Thank you!

Expand full comment
Anthralytic's avatar

Very helpful. Thanks!

Expand full comment
Rod Furlan's avatar

Small correction: the RTX4090 does support MXFP4 and can run the 20B model without issues

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the note. Looks like it was a transformer library thing and they merged a PR to fix that: https://github.com/huggingface/transformers/pull/39940

Will update this in the article, thanks for pointing it out

Expand full comment
Alexei R's avatar

Tesla T4 cos t is $700, and it runs 20b model. Two A6000 /48 runs 120b model.

Expand full comment
ZOO hoozoo's avatar

it so amzing analysis and so helpful! great job

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks!

Expand full comment
Xiaorui Wang, PhD's avatar

Hello Sebastian,

Awesome material. It really excites weekend reading. For the oss model, the dimensions per head woulb be 45 (2880/64) which is not a typical choice for LLM. Could you share some insights on this selection? Looking forward to hearing from you.

Expand full comment
Sebastian Raschka, PhD's avatar

I thought the exact same thing! Very unusual. But then, GPT-2 XL had 25 attention heads, which I thought being a bit weird back then, too!

Expand full comment
Andy Andurkar's avatar

Great blog, Thanks, You are amazing. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side., Intermediate Projection size should 8 * 768 , is it correct ?

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the kind words and feedback! So what this is showing is for a single expert, hence 768. But yeah if you consider all 8 active experts you have 8 such projections.

Expand full comment
Sepehr Akbari's avatar

Fantastic comparison and highlights! Thank you!

Expand full comment
Eric Engle's avatar

gpt-oss sucks.

qwen’s better.

gpt-oss sucks because it hallucinates All The Time.

Expand full comment
Sebastian Raschka, PhD's avatar

Yeah, I think it’s more meant to be used as a code and logic reasoning model. And for knowledge facts it’s supposed to be tool-calling (retrieving info from search queries). Or did you find that it hallucinates on coding task as well?

Expand full comment
Eric Engle's avatar

i didn't test it for coding, that's a secondary use case and Phi4, Codestral, and Codellama definitely perform well, so why bother? I mean, if it screws up text generation why check the rest of it, especially since Phi4 is great, codestral good and even codellama is nice (I use all three because one does somethings better than the others).

Expand full comment
Sebastian Raschka, PhD's avatar

I think they probably wanted to achieve good benchmarks scores, and a model of limited size can’t has limited capacity. And they seem to have prioritized reasoning. Likely they want to address the knowledge issue with tool calling. I mean, it makes sense because this way the model also doesn’t get outdated so quickly. But the open source ecosystem and tooling is not quite ready for tool calling.

Expand full comment
Eric Engle's avatar

we can imagine all sorts of reasons gpt-oss sucks. who cares? it sucks! it generates extensive long form content — too bad it’s chock full of ribald hallucinations. It’s unusable. Maybe the idea is “you can fine tune it, hit this dog in the ass long enough and we swear it will go fast and be vicious” but i’m not even gonna try to fine tune on it.

Expand full comment
Ram's avatar

I appreciate it but for me it is too technical. Please continue to post.

Expand full comment
P.M.SALMAN KHAN's avatar

Amazing Article, being new to this ! It was so intuitive how some x and y are choosen purely to improve efficiency and reduce computation.

Thank you !

Expand full comment
kevin's avatar

great!

Expand full comment
Matt Ma's avatar

Amazing work. Thanks!!

Expand full comment