This is the best article and explanation I have read so far. Clear, insightful, and wonderfully written. The illustrations are especially helpful in making the concepts easy to grasp. I would be very interested to hear your thoughts on hybrid systems, particularly the idea of complementing LLMs and Transformers with tools for handling large-scale numerical computations and models that replicate the physical laws of the world, as in some of Wolfram’s proposals
Thanks! I must admit that I am not familiar with Wolfram's approach. But I remember that early versions of AlphaFold (which is also based on the transformer architecture) incorporated physical constraints in the early versions but then later abandoned it. I.e., instead of hardcoding physical laws, they let the models learn those implicitly from the training data.
I think hard-coding things is tricky. Maybe a good work-around is to encourage more tool-calling by the LLM, which can use calculators, code, and in the future perhaps specific simulators. Actually, gpt-oss is meant to be used with tool calling, but the open-source tools running gpt-oss don't support it out of the box (yet)
Awesome material. It really excites weekend reading. For the oss model, the dimensions per head woulb be 45 (2880/64) which is not a typical choice for LLM. Could you share some insights on this selection? Looking forward to hearing from you.
Great blog, Thanks, You are amazing. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side., Intermediate Projection size should 8 * 768 , is it correct ?
Thanks for the kind words and feedback! So what this is showing is for a single expert, hence 768. But yeah if you consider all 8 active experts you have 8 such projections.
Yeah, I think it’s more meant to be used as a code and logic reasoning model. And for knowledge facts it’s supposed to be tool-calling (retrieving info from search queries). Or did you find that it hallucinates on coding task as well?
i didn't test it for coding, that's a secondary use case and Phi4, Codestral, and Codellama definitely perform well, so why bother? I mean, if it screws up text generation why check the rest of it, especially since Phi4 is great, codestral good and even codellama is nice (I use all three because one does somethings better than the others).
I think they probably wanted to achieve good benchmarks scores, and a model of limited size can’t has limited capacity. And they seem to have prioritized reasoning. Likely they want to address the knowledge issue with tool calling. I mean, it makes sense because this way the model also doesn’t get outdated so quickly. But the open source ecosystem and tooling is not quite ready for tool calling.
we can imagine all sorts of reasons gpt-oss sucks. who cares? it sucks! it generates extensive long form content — too bad it’s chock full of ribald hallucinations. It’s unusable. Maybe the idea is “you can fine tune it, hit this dog in the ass long enough and we swear it will go fast and be vicious” but i’m not even gonna try to fine tune on it.
Love this article! Quite helpful as always!
Dear Sebastian,
What about HRM?
https://github.com/sapientinc/HRM
https://arxiv.org/abs/2506.21734
Do you already have a say about it?
TY for all your work! 🙏
Thanks for the ping! I have it on my ever-growing list but haven’t had a chance yet.
MXFP4 was used already in the training path? Or After the last step of training (RL)?
Good question. They are unfortunately a bit sparse with the training details. According to the model card paper, they say
> We post-trained the models with quantization of the MoE weights to MXFP4 format
But it's unclear if that's the entire post-training, or a round of post-training after the original post-training.
This is the best article and explanation I have read so far. Clear, insightful, and wonderfully written. The illustrations are especially helpful in making the concepts easy to grasp. I would be very interested to hear your thoughts on hybrid systems, particularly the idea of complementing LLMs and Transformers with tools for handling large-scale numerical computations and models that replicate the physical laws of the world, as in some of Wolfram’s proposals
Thanks! I must admit that I am not familiar with Wolfram's approach. But I remember that early versions of AlphaFold (which is also based on the transformer architecture) incorporated physical constraints in the early versions but then later abandoned it. I.e., instead of hardcoding physical laws, they let the models learn those implicitly from the training data.
I think hard-coding things is tricky. Maybe a good work-around is to encourage more tool-calling by the LLM, which can use calculators, code, and in the future perhaps specific simulators. Actually, gpt-oss is meant to be used with tool calling, but the open-source tools running gpt-oss don't support it out of the box (yet)
Unbelievable and fantastic analysis, so very helpful. Thank you!
Very helpful. Thanks!
Small correction: the RTX4090 does support MXFP4 and can run the 20B model without issues
Thanks for the note. Looks like it was a transformer library thing and they merged a PR to fix that: https://github.com/huggingface/transformers/pull/39940
Will update this in the article, thanks for pointing it out
Tesla T4 cos t is $700, and it runs 20b model. Two A6000 /48 runs 120b model.
it so amzing analysis and so helpful! great job
Thanks!
Hello Sebastian,
Awesome material. It really excites weekend reading. For the oss model, the dimensions per head woulb be 45 (2880/64) which is not a typical choice for LLM. Could you share some insights on this selection? Looking forward to hearing from you.
I thought the exact same thing! Very unusual. But then, GPT-2 XL had 25 attention heads, which I thought being a bit weird back then, too!
Great blog, Thanks, You are amazing. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side., Intermediate Projection size should 8 * 768 , is it correct ?
Thanks for the kind words and feedback! So what this is showing is for a single expert, hence 768. But yeah if you consider all 8 active experts you have 8 such projections.
Fantastic comparison and highlights! Thank you!
gpt-oss sucks.
qwen’s better.
gpt-oss sucks because it hallucinates All The Time.
Yeah, I think it’s more meant to be used as a code and logic reasoning model. And for knowledge facts it’s supposed to be tool-calling (retrieving info from search queries). Or did you find that it hallucinates on coding task as well?
i didn't test it for coding, that's a secondary use case and Phi4, Codestral, and Codellama definitely perform well, so why bother? I mean, if it screws up text generation why check the rest of it, especially since Phi4 is great, codestral good and even codellama is nice (I use all three because one does somethings better than the others).
I think they probably wanted to achieve good benchmarks scores, and a model of limited size can’t has limited capacity. And they seem to have prioritized reasoning. Likely they want to address the knowledge issue with tool calling. I mean, it makes sense because this way the model also doesn’t get outdated so quickly. But the open source ecosystem and tooling is not quite ready for tool calling.
we can imagine all sorts of reasons gpt-oss sucks. who cares? it sucks! it generates extensive long form content — too bad it’s chock full of ribald hallucinations. It’s unusable. Maybe the idea is “you can fine tune it, hit this dog in the ass long enough and we swear it will go fast and be vicious” but i’m not even gonna try to fine tune on it.
I appreciate it but for me it is too technical. Please continue to post.
Amazing Article, being new to this ! It was so intuitive how some x and y are choosen purely to improve efficiency and reduce computation.
Thank you !
great!
Amazing work. Thanks!!