7 Comments

Very nice summary Sebastian!

FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.

You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f

Expand full comment

Thanks for sharing these additional insights from your convo with the authors!

Btw I also thought that they would do it more iteratively (since they show 1 to 100 samples in Figure 7, and their dataset must be much bigger). They also said "In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, ..."

So, when I understand the author response to your question correctly, it seems the "iteratively" is strictly only one iteration (collection & update) per model version RLHF-v1 -> RLHF-v2 -> RLHF-v3 -> RLHF-v4 then?

Expand full comment

When searching for rejection sampling, I discovered this article. Very nice summary! Thanks! 😊

Expand full comment

Glad to hear that this comes in handy!

Expand full comment

Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!

Expand full comment

I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.

Expand full comment

I honestly wouldn't write off RLHF just yet. Sure, DPO shows promise but there's no controlled study comparing both approaches side by side. The Zephyr model, for example, is just trained with DPO as far as I know. I.e., there is no direct comparison with the same architecture trained on the same dataset using RLHF. Also, benchmark performance is only one piece of the puzzle. As Meta researchers noted, RLHF makes LLMs safer (since that's what some of the reward models were trained on) but may make accuracy worse, which is not necessarily reflected in the average benchmark performance. What we would need is a thorough user preference study with respect to both safety and helpfulness.

Expand full comment