I frequently reference a process called Reinforcement Learning with Human Feedback (RLHF) when discussing LLMs, whether in the research news or tutorials. RLHF is an integral part of the modern LLM training pipeline due to its ability to incorporate human preferences into the optimization landscape, which can improve the model's helpfulness and safety.
Very nice summary Sebastian!
FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.
You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f