I frequently reference a process called Reinforcement Learning with Human Feedback (RLHF) when discussing LLMs, whether in the research news or tutorials.
FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.
Thanks for sharing these additional insights from your convo with the authors!
Btw I also thought that they would do it more iteratively (since they show 1 to 100 samples in Figure 7, and their dataset must be much bigger). They also said "In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, ..."
So, when I understand the author response to your question correctly, it seems the "iteratively" is strictly only one iteration (collection & update) per model version RLHF-v1 -> RLHF-v2 -> RLHF-v3 -> RLHF-v4 then?
Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!
I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.
I honestly wouldn't write off RLHF just yet. Sure, DPO shows promise but there's no controlled study comparing both approaches side by side. The Zephyr model, for example, is just trained with DPO as far as I know. I.e., there is no direct comparison with the same architecture trained on the same dataset using RLHF. Also, benchmark performance is only one piece of the puzzle. As Meta researchers noted, RLHF makes LLMs safer (since that's what some of the reward models were trained on) but may make accuracy worse, which is not necessarily reflected in the average benchmark performance. What we would need is a thorough user preference study with respect to both safety and helpfulness.
Very nice summary Sebastian!
FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.
You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f
Thanks for sharing these additional insights from your convo with the authors!
Btw I also thought that they would do it more iteratively (since they show 1 to 100 samples in Figure 7, and their dataset must be much bigger). They also said "In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, ..."
So, when I understand the author response to your question correctly, it seems the "iteratively" is strictly only one iteration (collection & update) per model version RLHF-v1 -> RLHF-v2 -> RLHF-v3 -> RLHF-v4 then?
Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!
I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.
I honestly wouldn't write off RLHF just yet. Sure, DPO shows promise but there's no controlled study comparing both approaches side by side. The Zephyr model, for example, is just trained with DPO as far as I know. I.e., there is no direct comparison with the same architecture trained on the same dataset using RLHF. Also, benchmark performance is only one piece of the puzzle. As Meta researchers noted, RLHF makes LLMs safer (since that's what some of the reward models were trained on) but may make accuracy worse, which is not necessarily reflected in the average benchmark performance. What we would need is a thorough user preference study with respect to both safety and helpfulness.