From GPT-2 to gpt-oss: Analyzing the Architectural Advances

Aug 9

Thanks for the ping! I have it on my ever-growing list but haven’t had a chance yet.

Expand full comment

MXFP4 was used already in the training path? Or After the last step of training (RL)?

Expand full comment

Good question. They are unfortunately a bit sparse with the training details. According to the model card paper, they say

> We post-trained the models with quantization of the MoE weights to MXFP4 format

But it's unclear if that's the entire post-training, or a round of post-training after the original post-training.

Expand full comment

This is the best article and explanation I have read so far. Clear, insightful, and wonderfully written. The illustrations are especially helpful in making the concepts easy to grasp. I would be very interested to hear your thoughts on hybrid systems, particularly the idea of complementing LLMs and Transformers with tools for handling large-scale numerical computations and models that replicate the physical laws of the world, as in some of Wolfram’s proposals

Expand full comment

Thanks! I must admit that I am not familiar with Wolfram's approach. But I remember that early versions of AlphaFold (which is also based on the transformer architecture) incorporated physical constraints in the early versions but then later abandoned it. I.e., instead of hardcoding physical laws, they let the models learn those implicitly from the training data.

I think hard-coding things is tricky. Maybe a good work-around is to encourage more tool-calling by the LLM, which can use calculators, code, and in the future perhaps specific simulators. Actually, gpt-oss is meant to be used with tool calling, but the open-source tools running gpt-oss don't support it out of the box (yet)

Expand full comment

Rudolf A Braun

Sep 3Edited

> (the same reason deep & slim neural nets perform better than shallow & wide neural nets, provided they are trained well)

Let's not forget this conclusion came about in a time period when many people were still training models with like 6 ff/conv layers total. Nowadays with dozens of transformer blocks being the norm I think people may have taken the statement too far and openai's arch seems to support this.

(I know you discuss this later but wanted to highlight)

Expand full comment

Sep 3

I think the bottleneck, historically, has been the classic vanishing and exploding gradient issue. With skip connections and all the RMSNorm placements, that's a bit mitigated at least.

Expand full comment

Unbelievable and fantastic analysis, so very helpful. Thank you!

Expand full comment

Very helpful. Thanks!

Expand full comment

Small correction: the RTX4090 does support MXFP4 and can run the 20B model without issues

Expand full comment

Reply (2)

Alexei R

Aug 30Edited

Tesla T4 cos t is $700, and it runs 20b model. Two A6000 /48 runs 120b model.

Expand full comment

Aug 13

Thanks for the note. Looks like it was a transformer library thing and they merged a PR to fix that: https://github.com/huggingface/transformers/pull/39940

Will update this in the article, thanks for pointing it out

Expand full comment

it so amzing analysis and so helpful! great job

Expand full comment

Thanks!

Expand full comment

Hello Sebastian,

Awesome material. It really excites weekend reading. For the oss model, the dimensions per head woulb be 45 (2880/64) which is not a typical choice for LLM. Could you share some insights on this selection? Looking forward to hearing from you.

Expand full comment

I thought the exact same thing! Very unusual. But then, GPT-2 XL had 25 attention heads, which I thought being a bit weird back then, too!

Expand full comment

Andy Andurkar

Aug 10Edited

Great blog, Thanks, You are amazing. Figure 13: A gpt-oss and Qwen3 model of comparable size side by side., Intermediate Projection size should 8 * 768 , is it correct ?

Expand full comment

Thanks for the kind words and feedback! So what this is showing is for a single expert, hence 768. But yeah if you consider all 8 active experts you have 8 such projections.

Expand full comment

Fantastic comparison and highlights! Thank you!

Expand full comment

gpt-oss sucks.

qwen’s better.

gpt-oss sucks because it hallucinates All The Time.

Expand full comment

Aug 10

Yeah, I think it’s more meant to be used as a code and logic reasoning model. And for knowledge facts it’s supposed to be tool-calling (retrieving info from search queries). Or did you find that it hallucinates on coding task as well?

Expand full comment

i didn't test it for coding, that's a secondary use case and Phi4, Codestral, and Codellama definitely perform well, so why bother? I mean, if it screws up text generation why check the rest of it, especially since Phi4 is great, codestral good and even codellama is nice (I use all three because one does somethings better than the others).

Expand full comment

Aug 10

I think they probably wanted to achieve good benchmarks scores, and a model of limited size can’t has limited capacity. And they seem to have prioritized reasoning. Likely they want to address the knowledge issue with tool calling. I mean, it makes sense because this way the model also doesn’t get outdated so quickly. But the open source ecosystem and tooling is not quite ready for tool calling.

Expand full comment

we can imagine all sorts of reasons gpt-oss sucks. who cares? it sucks! it generates extensive long form content — too bad it’s chock full of ribald hallucinations. It’s unusable. Maybe the idea is “you can fine tune it, hit this dog in the ass long enough and we swear it will go fast and be vicious” but i’m not even gonna try to fine tune on it.

Expand full comment