Understanding Reasoning LLMs

Sebastian Raschka, PhD

Feb 5

Methods and Strategies for Building and Refining Reasoning Models

39 Comments

The paper also includes some insights on how to prompt reasoning models:

(1) Zero-shot outperforms few-shot - Their extensive testing revealed that few-shot prompting consistently degrades model performance, contrary to traditional LLM best practices.

(2) Direct problem description wins - The model performs best when users simply state the problem and specify the output format, avoiding complex prompting patterns.

(3) Language consistency matters - Using the same language throughout the prompt is crucial, as the model can mix languages in reasoning chains when prompts contain multiple languages.

Expand full comment

Just wanted to express my appreciation for this and all your previous posts. I have always found value in them and look forward to the next one.

Expand full comment

Sebastian Raschka, PhD

Thanks, Binit!

Expand full comment

It seems like the reasoning models don't know when not to reason. For example, If I ask a reasoning LLM a very factual question (e.g., Who is the author of X), it will still go through a thinking process although it is entirely unnecessary. Why can't it "reason" out saying "Okay, this is a factual question, I already know the answer, I don't need to reason"? I would love to know your thoughts on that.

Expand full comment

Sebastian Raschka, PhD

That's a good question. Some reasoning models can actually do that quite ok. I mean, if you type "What is 2+2?" into o1 it won't attempt to do any reasoning there but just give you the answer. I think it's all a matter of diversity in the training data and preference tuning for refinement. But in any case, a model that can do both extensive reasoning with intermediate steps will sometimes accidentally also apply that even if it's not necessary.

Expand full comment

Very nice summary! Thanks!

Expand full comment

Great Summary. I have a question on inference time compute. When you say giving more time to think, I’m wondering what it means physically at matmul layer. If everything at fundamental level boils down to choosing a tensor with maximum probability and the dims corresponds to maximum amount of info that can be held, I always equate it to the ability of processing large numbers hence gpus. So inference time compute would equate to throwing more compute power. But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?

Expand full comment

Sebastian Raschka, PhD

Good question. In this case, none of the commonly used inference-time scaling techniques are that low level at the matmul level. Actually it would be impossible because then you have a mismatch between training and inference in the architecture itself (e.g. you can't increase the weight matrix just during inference, it would have to also been modified during training already, but that's not inference-time scaling then, just general scaling).

So, in the case of "But are you suggesting that inference time is simply more time to predict the next token with “thinking step by step” in the input sequence would do the reasoning trick ?", the inference scaling comes from the fact that the model just generates more tokens. I.e., if you add the "think step by step", the model may generate 2x as many tokens, which makes the inference 2x more expensive. I hope that answers your question (pls let me know if not).

Expand full comment

It does. Thank you 🙏

Expand full comment

Seb, if I can ask one more follow up: how’s test time scaling (I.e asking the model to generate more tokens) different from calling the model API twice with the same prompt? Sampling effectively does the same internally right ?

Expand full comment

Feb 5Edited

You mention Cold Start with respect to the R1-Zero model at the end of the preliminary "A brief look at the DeepSeek training pipeline" section, and mention that

"This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step..."

When I examine the paper I don't see "cold start" referenced with respect to R1-Zero, but to the full R1 base model. And - in particular - it seems a response to the interesting but sub-optimal results of just applying RL without SFT in the R1-Zero case.

Am I missing something?

I see the first reference to "cold start" I see in the paper is made with respect to the R1 base model discussed in Section 2.3.1, where it seems to explicitly refer to a small round of SFT prior to RL - to quote from the first sentence of the second paragraph of that Section

"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as

the starting point for RL."

This small round of SFT seems to boost the efficacy of the following round of RL, with several additional rounds of SFT / RL then applied afterwards.

Moreover, from their description of the data used for this fine tuning in the preceeding paragraph - it is unclear that this data was raw output from the R1-Zero model, as it is described as:

"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators."

Paper link https://arxiv.org/pdf/2501.12948

Expand full comment

Sebastian Raschka, PhD

Thanks for the comment! In R1 they use "cold start" data from the R1-Zero model to train V3 to become R1. The fact that they use cold start data from R1-Zero is why I called R1-Zero the "cold start" model. Here, I am thinking of the term "cold start" as "starting" without warming up the model with SFT.

Expand full comment

A bit late but this was an insightful and succinct post 🙏🏻

Expand full comment

Sebastian Raschka, PhD

Thanks!

Expand full comment

Thanks for another great post. Your knowledge on this is really vast.

I have a quick question on terminology: My understanding is that models such as 4o also underwent SFT + RL, to some degree, at least. So why OAI explicitly mark them as "non-reasoning models"? By the fact of SFT & RL training they underwent, why not?

Expand full comment

Sebastian Raschka, PhD

You are right regarding RL. They didn't share any details on their models after GPT-3, but it is highly likely that it underwent pre-training + SFT + RLHF. Almost all models used RLHF during post-training. The main difference in the non-reasoning vs reasoning model is that for reasoning models there's additional RLVR (similar to what's discussed in the DeepSeek-R1 paper)

Expand full comment

Thanks a lot Sebastian for helping me understand this part better.

In fact, you mentioned this in this para:

" In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types. All in all, this is very similar to regular RLHF except that the SFT data contains (more) CoT examples. And the RL has verifiable rewards in addition to human preference-based rewards."

Appreciate it.

Expand full comment

Feb 16Edited

appreciate for your great post! is there any pure RL approach to improve the non-reasoning capabilities, such as the anthropomorphic chatting?

Expand full comment

Sebastian Raschka, PhD

Hi there, I believe that would just be regular preference tuning via RLHF (https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives). But everyone does that after SFT; I haven't seen anyone skipping SFT.

Expand full comment

Thank you for distilling us! 👍

Expand full comment

Sebastian Raschka, PhD

haha, you are welcome

Expand full comment

Feb 9Edited

In Section 3, regarding "Supervised Fine-tuning and Reinforcement Learning (SFT + RL)," refer to Section 2.3.3 of the DeepSeekR1 paper, which states that 800k samples are used for SFT on the DeepSeekV3-based model for two epochs before applying RL in all scenarios. Therefore, I think it would be add one more SFT step before last RL stage to your graph.

Am I misunderstanding anything?

Expand full comment

Sebastian Raschka, PhD

Thanks for the note. I think I should have had a fresh arrow coming down from the base model there. I updated it.

Expand full comment

Thanks for the update! It looks much clearer now.

Expand full comment

I'm not sure I understand the "Pure reinforcement learning" part. So by just generating response to a question, scoring it by returning yes/no from some external validation tool and modifying neuron values according to that, LLM model developed the same behaviour that was previously gained by training on data from humans?

Expand full comment

Sebastian Raschka, PhD

Good points. And the answer is "yes and no". Previously, the behavior was developed by SFT+RL but it doesn't need to be from humans. It can be machine generated in both cases. Another example of these verifiable rewards in the RL stage was in the Tulu 2 or 3 paper where they introduced reinforcement learning with verifiable rewards. So that part about the external validation is not necessarily new. But the fact that it's sufficient and that one can skip SFT (which can be either human or machine generated) is new.

Expand full comment

And SFT in terms of preparing "reasoning" version of a model is just: training on some examples of answers that include splitting problem into smaller steps and solving it one by one?

So without SFT that would mean that model developed this ability all by itself? That's impressive. Or could it be that the "base" model already had some examples of that in training set and RL just amplified them during fine-tuning?

Expand full comment

"According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost."

-> I think there is a small mistake in the text: According to the evaluation table (https://novasky-ai.github.io/posts/sky-t1/), they compared it with the smaller model "o1-preview", or am I wrong?

Expand full comment

Sebastian Raschka, PhD

Oh, I thought o1-preview was considered better/on par with o1, but it's been some time. Just did a quick search: https://community.openai.com/t/performance-o1-vs-o1-preview/1046831

Expand full comment

Yeah, I was just mentioning that because these are different models, even though they perform similarly, depending on the task. I have also found this benchmark table:

https://docsbot.ai/models/compare/o1/o1-preview#benchmarks

Expand full comment

there is a mistake as R1 is not from R1-zero.

Expand full comment

Sebastian Raschka, PhD

Thanks for the comment, the figures should reflect that. Or is there any place in the text where this is wrong? Thanks for letting me know!

Expand full comment

Forrest Bennett

It is now correct in the first 3 flowcharts, but in the final flowchart in Section 4 titled, "The development process of DeepSeek-R1-Distill models.", it still shows R1 as from R1-zero.

Expand full comment

Feb 16Edited

You're one of my favorite writers in the ML space since years and make the content of complicated and complicated written papers so much more approachable.

Again, thank you very much for explaining another topic in a comprehensible way!

Expand full comment

Myteacher Pike (Pike)

I have a solution to programming a reasoning LLM possibly better than MemOS

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts