The paper also includes some insights on how to prompt reasoning models:
(1) Zero-shot outperforms few-shot - Their extensive testing revealed that few-shot prompting consistently degrades model performance, contrary to traditional LLM best practices.
(2) Direct problem description wins - The model performs best when users simply state the problem and specify the output format, avoiding complex prompting patterns.
(3) Language consistency matters - Using the same language throughout the prompt is crucial, as the model can mix languages in reasoning chains when prompts contain multiple languages.
It seems like the reasoning models don't know when not to reason. For example, If I ask a reasoning LLM a very factual question (e.g., Who is the author of X), it will still go through a thinking process although it is entirely unnecessary. Why can't it "reason" out saying "Okay, this is a factual question, I already know the answer, I don't need to reason"? I would love to know your thoughts on that.
That's a good question. Some reasoning models can actually do that quite ok. I mean, if you type "What is 2+2?" into o1 it won't attempt to do any reasoning there but just give you the answer. I think it's all a matter of diversity in the training data and preference tuning for refinement. But in any case, a model that can do both extensive reasoning with intermediate steps will sometimes accidentally also apply that even if it's not necessary.
Great Summary. I have a question on inference time compute. When you say giving more time to think, I’m wondering what it means physically at matmul layer. If everything at fundamental level boils down to choosing a tensor with maximum probability and the dims corresponds to maximum amount of info that can be held, I always equate it to the ability of processing large numbers hence gpus. So inference time compute would equate to throwing more compute power. But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?
Good question. In this case, none of the commonly used inference-time scaling techniques are that low level at the matmul level. Actually it would be impossible because then you have a mismatch between training and inference in the architecture itself (e.g. you can't increase the weight matrix just during inference, it would have to also been modified during training already, but that's not inference-time scaling then, just general scaling).
So, in the case of "But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?", the inference scaling comes from the fact that the model just generates more tokens. I.e., if you add the "think step by step", the model may generate 2x as many tokens, which makes the inference 2x more expensive. I hope that answers your question (pls let me know if not).
Seb, if I can ask one more follow up: how’s test time scaling (I.e asking the model to generate more tokens) different from calling the model API twice with the same prompt? Sampling effectively does the same internally right ?
You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that
"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."
When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.
Am I missing something?
I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section
"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL."
This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.
Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:
"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."
Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.
In Section 3, regarding "Supervised Fine-tuning and Reinforcement Learning (SFT + RL)," refer to Section 2.3.3 of the DeepSeekR1 paper, which states that 800k samples are used for SFT on the DeepSeekV3-based model for two epochs before applying RL in all scenarios. Therefore, I think it would be add one more SFT step before last RL stage to your graph.
I'm not sure I understand the "Pure reinforcement learning" part. So by just generating response to a question, scoring it by returning yes/no from some external validation tool and modifying neuron values according to that, LLM model developed the same behaviour that was previously gained by training on data from humans?
Good points. And the answer is "yes and no". Previously, the behavior was developed by SFT+RL but it doesn't need to be from humans. It can be machine generated in both cases. Another example of these verifiable rewards in the RL stage was in the Tulu 2 or 3 paper where they introduced reinforcement learning with verifiable rewards. So that part about the external validation is not necessarily new. But the fact that it's sufficient and that one can skip SFT (which can be either human or machine generated) is new.
And SFT in terms of preparing "reasoning" version of a model is just: training on some examples of answers that include splitting problem into smaller steps and solving it one by one?
So without SFT that would mean that model developed this ability all by itself? That's impressive. Or could it be that the "base" model already had some examples of that in training set and RL just amplified them during fine-tuning?
"According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost."
-> I think there is a small mistake in the text: According to the evaluation table (https://novasky-ai.github.io/posts/sky-t1/), they compared it with the smaller model "o1-preview", or am I wrong?
Yeah, I was just mentioning that because these are different models, even though they perform similarly, depending on the task. I have also found this benchmark table:
It is now correct in the first 3 flowcharts, but in the final flowchart in Section 4 titled, "The development process of DeepSeek-R1-Distill models.", it still shows R1 as from R1-zero.
Also what are your thoughts on using more sophisticated mechadims to do the reasoning/explore latent spaces?
When I worked in supply chain risk analysis we had the supply chains mapped as Directed Graphs. Using Bayesian Belief Models (not BNNs) we were able to run simulations with incomplete information - what happens when I disrupt one part of the supply chain, but don't have data on all the others ts.
When I was reading Coconut paper my thought was that it seemed like they'd turned the latent space into a graph. This should make more sophisticated techniques (such as a using some Bayesian inference to calculate more contextual next tokens) of decoding possible.
Just a random thought I'm throwing. Would love to hear your opinion on if you think this is useful.
Excellent work Seb. Would you be interested in guest posting this on my newsletter? You wouldn't have to do too much more- just take copy this article over and I'll write a quick intro.
You're one of my favorite writers in the ML space since years and make the content of complicated and complicated written papers so much more approachable.
Again, thank you very much for explaining another topic in a comprehensible way!
The paper also includes some insights on how to prompt reasoning models:
(1) Zero-shot outperforms few-shot - Their extensive testing revealed that few-shot prompting consistently degrades model performance, contrary to traditional LLM best practices.
(2) Direct problem description wins - The model performs best when users simply state the problem and specify the output format, avoiding complex prompting patterns.
(3) Language consistency matters - Using the same language throughout the prompt is crucial, as the model can mix languages in reasoning chains when prompts contain multiple languages.
Just wanted to express my appreciation for this and all your previous posts. I have always found value in them and look forward to the next one.
Thanks, Binit!
It seems like the reasoning models don't know when not to reason. For example, If I ask a reasoning LLM a very factual question (e.g., Who is the author of X), it will still go through a thinking process although it is entirely unnecessary. Why can't it "reason" out saying "Okay, this is a factual question, I already know the answer, I don't need to reason"? I would love to know your thoughts on that.
That's a good question. Some reasoning models can actually do that quite ok. I mean, if you type "What is 2+2?" into o1 it won't attempt to do any reasoning there but just give you the answer. I think it's all a matter of diversity in the training data and preference tuning for refinement. But in any case, a model that can do both extensive reasoning with intermediate steps will sometimes accidentally also apply that even if it's not necessary.
Great Summary. I have a question on inference time compute. When you say giving more time to think, I’m wondering what it means physically at matmul layer. If everything at fundamental level boils down to choosing a tensor with maximum probability and the dims corresponds to maximum amount of info that can be held, I always equate it to the ability of processing large numbers hence gpus. So inference time compute would equate to throwing more compute power. But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?
Good question. In this case, none of the commonly used inference-time scaling techniques are that low level at the matmul level. Actually it would be impossible because then you have a mismatch between training and inference in the architecture itself (e.g. you can't increase the weight matrix just during inference, it would have to also been modified during training already, but that's not inference-time scaling then, just general scaling).
So, in the case of "But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?", the inference scaling comes from the fact that the model just generates more tokens. I.e., if you add the "think step by step", the model may generate 2x as many tokens, which makes the inference 2x more expensive. I hope that answers your question (pls let me know if not).
It does. Thank you 🙏
Seb, if I can ask one more follow up: how’s test time scaling (I.e asking the model to generate more tokens) different from calling the model API twice with the same prompt? Sampling effectively does the same internally right ?
You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that
"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."
When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.
Am I missing something?
I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section
"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL."
This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.
Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:
"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."
Paper link https://arxiv.org/pdf/2501.12948
Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.
appreciate for your great post! is there any pure RL approach to improve the non-reasoning capabilities, such as the anthropomorphic chatting?
Hi there, I believe that would just be regular preference tuning via RLHF (https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives). But everyone does that after SFT; I haven't seen anyone skipping SFT.
Very nice summary! Thanks!
Thank you for distilling us! 👍
haha, you are welcome
In Section 3, regarding "Supervised Fine-tuning and Reinforcement Learning (SFT + RL)," refer to Section 2.3.3 of the DeepSeekR1 paper, which states that 800k samples are used for SFT on the DeepSeekV3-based model for two epochs before applying RL in all scenarios. Therefore, I think it would be add one more SFT step before last RL stage to your graph.
Am I misunderstanding anything?
Thanks for the note. I think I should have had a fresh arrow coming down from the base model there. I updated it.
Thanks for the update! It looks much clearer now.
I'm not sure I understand the "Pure reinforcement learning" part. So by just generating response to a question, scoring it by returning yes/no from some external validation tool and modifying neuron values according to that, LLM model developed the same behaviour that was previously gained by training on data from humans?
Good points. And the answer is "yes and no". Previously, the behavior was developed by SFT+RL but it doesn't need to be from humans. It can be machine generated in both cases. Another example of these verifiable rewards in the RL stage was in the Tulu 2 or 3 paper where they introduced reinforcement learning with verifiable rewards. So that part about the external validation is not necessarily new. But the fact that it's sufficient and that one can skip SFT (which can be either human or machine generated) is new.
And SFT in terms of preparing "reasoning" version of a model is just: training on some examples of answers that include splitting problem into smaller steps and solving it one by one?
So without SFT that would mean that model developed this ability all by itself? That's impressive. Or could it be that the "base" model already had some examples of that in training set and RL just amplified them during fine-tuning?
"According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost."
-> I think there is a small mistake in the text: According to the evaluation table (https://novasky-ai.github.io/posts/sky-t1/), they compared it with the smaller model "o1-preview", or am I wrong?
Oh, I thought o1-preview was considered better/on par with o1, but it's been some time. Just did a quick search: https://community.openai.com/t/performance-o1-vs-o1-preview/1046831
Yeah, I was just mentioning that because these are different models, even though they perform similarly, depending on the task. I have also found this benchmark table:
https://docsbot.ai/models/compare/o1/o1-preview#benchmarks
there is a mistake as R1 is not from R1-zero.
Thanks for the comment, the figures should reflect that. Or is there any place in the text where this is wrong? Thanks for letting me know!
It is now correct in the first 3 flowcharts, but in the final flowchart in Section 4 titled, "The development process of DeepSeek-R1-Distill models.", it still shows R1 as from R1-zero.
Also what are your thoughts on using more sophisticated mechadims to do the reasoning/explore latent spaces?
When I worked in supply chain risk analysis we had the supply chains mapped as Directed Graphs. Using Bayesian Belief Models (not BNNs) we were able to run simulations with incomplete information - what happens when I disrupt one part of the supply chain, but don't have data on all the others ts.
When I was reading Coconut paper my thought was that it seemed like they'd turned the latent space into a graph. This should make more sophisticated techniques (such as a using some Bayesian inference to calculate more contextual next tokens) of decoding possible.
Just a random thought I'm throwing. Would love to hear your opinion on if you think this is useful.
Excellent work Seb. Would you be interested in guest posting this on my newsletter? You wouldn't have to do too much more- just take copy this article over and I'll write a quick intro.
You're one of my favorite writers in the ML space since years and make the content of complicated and complicated written papers so much more approachable.
Again, thank you very much for explaining another topic in a comprehensible way!