I frequently reference a process called Reinforcement Learning with Human Feedback (RLHF) when discussing LLMs, whether in the research news or tutorials. RLHF is an integral part of the modern LLM training pipeline due to its ability to incorporate human preferences into the optimization landscape, which can improve the model's helpfulness and safety.
FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.
Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!
I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.
Very nice summary Sebastian!
FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.
You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f
Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!
I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.