5 Comments
Sep 11, 2023Liked by Sebastian Raschka, PhD

Very nice summary Sebastian!

FYI from correspondence I had with the Llama 2 authors, rejection sampling is done in an “offline” fashion where one first generates K samples per prompt in the dataset, then applies ranking + SFT.

You can read more about this here: https://huggingface.co/papers/2307.09288#64c6961115bd12e5798b9e3f

Expand full comment
Feb 27Liked by Sebastian Raschka, PhD

Awesome! This was the kind of article that I've been extensively searching for the past week to understand the lifecycle of a training an LLM from the start to the end. Loved your other articles as well. Please keep them coming!

Expand full comment
Oct 13, 2023Liked by Sebastian Raschka, PhD

I was thinking a lot recently about the developments in RLHF and the community opinion on this. I saw some people saying it's worthless or useless (specifically on X/Twitter and in some white papers) now that the DPO alignment method displayed great results. I appreciate the deep analysis on the differences between methods; it's clear that we have many different ways to achieve helpfulness and safety (among other properties). The Constitutional AI method for alignment seems to be promising, as it's a more interpretable and easy-to-realign method IMO. At the same time, I don't see a lot of these alignment methods for helpfulness and safety speaking to the varying cultural perspectives on these properties.

Expand full comment