You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that
"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."
When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.
Am I missing something?
I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section
"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL."
This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.
Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:
"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."
Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.
there is a mistake as R1 is not from R1-zero.
Thanks for the comment, the figures should reflect that. Or is there any place in the text where this is wrong? Thanks for letting me know!
You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that
"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."
When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.
Am I missing something?
I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section
"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as
the starting point for RL."
This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.
Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:
"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."
Paper link https://arxiv.org/pdf/2501.12948
Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.