Very nice summary Sebastian!

FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.

You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f

