The State of Reinforcement Learning for LLM…

Sebastian Raschka, PhD

Apr 19

Understanding GRPO and New Insights from Reasoning Model Papers

31 Comments

How do you deal with finding and sorting through all the papers that keep coming out in this field? I think it's the aspect I struggle with most, and you clearly are doing a very good job here, so hopefully you have some suggestions.

What tools do you use to do this, especially for new papers? How do you easily distinguish papers worth reading from garbage when there are no citations to speak of? How many abstracts do you read compared to full papers? Do you usually skim-read / LLM summarize to get the gist, or do you actually spend the time to go through everything and attempt to thoroughly understand every single equation presented?

Great post btw, as always.

Expand full comment

Sebastian Raschka, PhD

I must say it’s big challenge to find those diamonds in the rough. I used to be one of the machine learning moderators at arxiv a few years ago — there were three of us taking turns skimming through the titles of all submissions from the ML category (cs.LG). Mainly the titles but sometimes also the abstracts etc. I was mainly to check that the articles were categorized correctly (submitters choose categories and we additionally had classifiers for flagging miscategorized ones), but that wasn’t perfect. Long story short, I build a habit of skimming arxiv titles (not always but pretty often). Also I bookmark things I stumble upon on social media etc. I purely go by title first to keep a sub list of interesting papers. Then I select a sub-sub list and read the abstracts. Then based on sub-sub-sublist, I would skim (quickly read through) those ones that seem interesting for a given projects. And finally I read 1-3 papers a week more carefully. But yeah ultimately it’s a lot of work.

Expand full comment

Great stuff, as always. The inline cards linking to previous articles are very handy. 🙏 👍

Expand full comment

Sebastian Raschka, PhD

Thanks!

Expand full comment

Wow. Great article! You saved me hours of reading. Thank you!

Expand full comment

Ilona Brinkmeier

As always, perfect article of yours!

Expand full comment

Great review!

I’m a bit confused—has reinforcement learning actually caused any emergent abilities yet? It may require deeper investigation.

Expand full comment

Sebastian Raschka, PhD

According to the DeepSeek-R1 paper it has. Ie the “Aha” moment that Models exhibited after multiple rounds of RL. There are papers saying that base models also already have these emergent abilities. In my opinion it’s not conclusive though because nowadays chain-of-thought data is exceedingly part of the pre-training data mix.

Expand full comment

greate!

Expand full comment

This is super cool. The explanation of RLHF with PPO takes the cake. Also a super curated list of papers with short explanations.

Expand full comment

Great article! Very clear explanations, it's great to have these summaries and not have to spend hours going through the papers.

Expand full comment

Sebastian Raschka, PhD

Thanks!

Expand full comment

This is a brilliant read. You never fail to impress !

Expand full comment

What a great post!!! Thank you so much. I loved it and it helped me understand a lot of things.

Just a small detail: I think there’s a typo in "the just-realized o3 model". I assume you meant "just-released".

Thanks again for your work!

Expand full comment

Sebastian Raschka, PhD

Thanks! And you are totally right, I meant to write "just-released" not "just-realized". This was "just-fixed" :D

Expand full comment

nicola leonardi

in relation to this sentence: "And since distillation, in this paper, meant instruction fine-tuning on chain-of-thought data, it's likely that pre-training on data that includes chain-of-thought data induces these abilities as well" I think it is completely analogous to the concept of continue-pretraining where I try to adapt the model to my context in a next-token prediction approach. I consider it "obvious", in the sense that it does not surprise me at all.

Expand full comment

Sebastian Raschka, PhD

Yes, the training task in pre-training and SFT is exactly the next-token prediction task with the same cross-entropy loss.

Btw another example is that many base models can now follow instructions quite well to some extend (whereas, back in the day, base models were terrible at that). This is mostly due to the fact that pre-training data now also contains Q&A data. Some of it is coincidental, but it's also often done deliberately now as I described a while back in

"New LLM Pre-training and Post-training Paradigms" (https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training)

Expand full comment

Apr 20Edited

I loved the combo of the RL summary and the selected papers. I definitely want to check out the one about the logic puzzles and the one extending RL to other domains; it seems like there’s still a lot to be discovered in these spaces.

Expand full comment

Sir you're one of the greatest minds in AI. Another excellent compilation. Your work is genuinely inspiring

Expand full comment

Sebastian Raschka, PhD

Thanks, Devansh!

Expand full comment

Dr. Ashish Bamania

Very detailed and interesting. Thanks for it!

Expand full comment

Alessandro Pessoa

Olá Raschka, obrigado pelos artigos sempre excelente conteúdo

Expand full comment

Quite a detailed overview, great work!

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts