31 Comments

The paper also includes some insights on how to prompt reasoning models:

(1) Zero-shot outperforms few-shot - Their extensive testing revealed that few-shot prompting consistently degrades model performance, contrary to traditional LLM best practices.

(2) Direct problem description wins - The model performs best when users simply state the problem and specify the output format, avoiding complex prompting patterns.

(3) Language consistency matters - Using the same language throughout the prompt is crucial, as the model can mix languages in reasoning chains when prompts contain multiple languages.

Expand full comment

Just wanted to express my appreciation for this and all your previous posts. I have always found value in them and look forward to the next one.

Expand full comment

Thanks, Binit!

Expand full comment

It seems like the reasoning models don't know when not to reason. For example, If I ask a reasoning LLM a very factual question (e.g., Who is the author of X), it will still go through a thinking process although it is entirely unnecessary. Why can't it "reason" out saying "Okay, this is a factual question, I already know the answer, I don't need to reason"? I would love to know your thoughts on that.

Expand full comment

That's a good question. Some reasoning models can actually do that quite ok. I mean, if you type "What is 2+2?" into o1 it won't attempt to do any reasoning there but just give you the answer. I think it's all a matter of diversity in the training data and preference tuning for refinement. But in any case, a model that can do both extensive reasoning with intermediate steps will sometimes accidentally also apply that even if it's not necessary.

Expand full comment

Great Summary. I have a question on inference time compute. When you say giving more time to think, I’m wondering what it means physically at matmul layer. If everything at fundamental level boils down to choosing a tensor with maximum probability and the dims corresponds to maximum amount of info that can be held, I always equate it to the ability of processing large numbers hence gpus. So inference time compute would equate to throwing more compute power. But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?

Expand full comment

Good question. In this case, none of the commonly used inference-time scaling techniques are that low level at the matmul level. Actually it would be impossible because then you have a mismatch between training and inference in the architecture itself (e.g. you can't increase the weight matrix just during inference, it would have to also been modified during training already, but that's not inference-time scaling then, just general scaling).

So, in the case of "But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?", the inference scaling comes from the fact that the model just generates more tokens. I.e., if you add the "think step by step", the model may generate 2x as many tokens, which makes the inference 2x more expensive. I hope that answers your question (pls let me know if not).

Expand full comment

It does. Thank you 🙏

Expand full comment

Seb, if I can ask one more follow up: how’s test time scaling (I.e asking the model to generate more tokens) different from calling the model API twice with the same prompt? Sampling effectively does the same internally right ?

Expand full comment

You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that

"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."

When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.

Am I missing something?

I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section

"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as

the starting point for RL."

This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.

Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:

"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."

Paper link https://arxiv.org/pdf/2501.12948

Expand full comment

Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.

Expand full comment

appreciate for your great post! is there any pure RL approach to improve the non-reasoning capabilities, such as the anthropomorphic chatting?

Expand full comment

Hi there, I believe that would just be regular preference tuning via RLHF (https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives). But everyone does that after SFT; I haven't seen anyone skipping SFT.

Expand full comment

Very nice summary! Thanks!

Expand full comment

Thank you for distilling us! 👍

Expand full comment

haha, you are welcome

Expand full comment

In Section 3, regarding "Supervised Fine-tuning and Reinforcement Learning (SFT + RL)," refer to Section 2.3.3 of the DeepSeekR1 paper, which states that 800k samples are used for SFT on the DeepSeekV3-based model for two epochs before applying RL in all scenarios. Therefore, I think it would be add one more SFT step before last RL stage to your graph.

Am I misunderstanding anything?

Expand full comment

Thanks for the note. I think I should have had a fresh arrow coming down from the base model there. I updated it.

Expand full comment

Thanks for the update! It looks much clearer now.

Expand full comment

I'm not sure I understand the "Pure reinforcement learning" part. So by just generating response to a question, scoring it by returning yes/no from some external validation tool and modifying neuron values according to that, LLM model developed the same behaviour that was previously gained by training on data from humans?

Expand full comment

Good points. And the answer is "yes and no". Previously, the behavior was developed by SFT+RL but it doesn't need to be from humans. It can be machine generated in both cases. Another example of these verifiable rewards in the RL stage was in the Tulu 2 or 3 paper where they introduced reinforcement learning with verifiable rewards. So that part about the external validation is not necessarily new. But the fact that it's sufficient and that one can skip SFT (which can be either human or machine generated) is new.

Expand full comment

And SFT in terms of preparing "reasoning" version of a model is just: training on some examples of answers that include splitting problem into smaller steps and solving it one by one?

So without SFT that would mean that model developed this ability all by itself? That's impressive. Or could it be that the "base" model already had some examples of that in training set and RL just amplified them during fine-tuning?

Expand full comment

"According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost."

-> I think there is a small mistake in the text: According to the evaluation table (https://novasky-ai.github.io/posts/sky-t1/), they compared it with the smaller model "o1-preview", or am I wrong?

Expand full comment

Oh, I thought o1-preview was considered better/on par with o1, but it's been some time. Just did a quick search: https://community.openai.com/t/performance-o1-vs-o1-preview/1046831

Expand full comment

Yeah, I was just mentioning that because these are different models, even though they perform similarly, depending on the task. I have also found this benchmark table:

https://docsbot.ai/models/compare/o1/o1-preview#benchmarks

Expand full comment

there is a mistake as R1 is not from R1-zero.

Expand full comment

Thanks for the comment, the figures should reflect that. Or is there any place in the text where this is wrong? Thanks for letting me know!

Expand full comment

It is now correct in the first 3 flowcharts, but in the final flowchart in Section 4 titled, "The development process of DeepSeek-R1-Distill models.", it still shows R1 as from R1-zero.

Expand full comment

Also what are your thoughts on using more sophisticated mechadims to do the reasoning/explore latent spaces?

When I worked in supply chain risk analysis we had the supply chains mapped as Directed Graphs. Using Bayesian Belief Models (not BNNs) we were able to run simulations with incomplete information - what happens when I disrupt one part of the supply chain, but don't have data on all the others ts.

When I was reading Coconut paper my thought was that it seemed like they'd turned the latent space into a graph. This should make more sophisticated techniques (such as a using some Bayesian inference to calculate more contextual next tokens) of decoding possible.

Just a random thought I'm throwing. Would love to hear your opinion on if you think this is useful.

Expand full comment

Excellent work Seb. Would you be interested in guest posting this on my newsletter? You wouldn't have to do too much more- just take copy this article over and I'll write a quick intro.

Expand full comment

You're one of my favorite writers in the ML space since years and make the content of complicated and complicated written papers so much more approachable.

Again, thank you very much for explaining another topic in a comprehensible way!

Expand full comment