I love how you can distill complex information in a digestable format for people who doesn't have a strong background on all the technicalities of LLMs, so thank you Sebastian
* We're working on your training correlation caveat :)
* Now we're at the phase where we are getting closed models added to the benchmark to show the gap open needs to close (because good alignment capabilities are important for good societal outcomes.
* The leaderboard is characterized by the design space not being well explored, so more DPO models exist because they're popular. I don't expect this to change too much, but already more people are training RMs since release! (a specific training blog post here: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=21)
Great summary of the benchmark. Keep up the great work.
Great article as always. I hate to be that guy, but I figured I'd bring it to your attention. Just 2 minor edits for you, in the "Common 7B Language Models Already Possess Strong Math Capabilities" section you spelled instruction-finetuning wrong. In the "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" there is ) with no matching ( to go with it.
Whoa, that's a good eye for detail. Appreciate it that you were giving a thorough read! And luckily these were two quick & easy fixes. Should be updated now!
1. Why is it that this linear warmup and cosine decay avoid catastrophic forgetting?
a) I suppose the linear warm-up helps to ensure that the model doesn't get jolted away from the starting point (by allowing for more smooth setting up of the gradients). It strikes me that just starting with the finishing optimizer states might be sufficient and a better approach to doing this.
b) I don't really see how the cosine decay helps avoid catastrophic forgetting. Seems to me that just helps to hone the optimisation as the optimal point gets closer and smaller tweaks are needed.
2. I assume you've looked at ORPO. I found DPO could be hard to get working robustly, but orpo worked robustly as an alternative to SFT for me - you can see a short video on a comparison I did between SFT and ORPO.
> 1. Why is it that this linear warmup and cosine decay avoid catastrophic forgetting?
The avoiding of catastrophic forgetting comes actually from the replay part, that is, adding a small fraction of the original data to the new training mix. The linear warmup and cosine decay is to stabilize the training and help with convergence.
> 2. I assume you've looked at ORPO.
I actually haven't used ORPO (you mean ORPO: Monolithic Preference Optimization without Reference Model, right?). I agree that DPO can be tricky in practice. It's sometimes hard to find good settings and not under or overtrain it. I kind of missed ORPO and will read into it. Thanks for the recommendation!
For a lot of these types of papers (unfortunately both DoRA and DPO) I run the new method versus SFT as the baseline and get worse (or no better) performance. But ORPO seemed to consistently do a bit better (certainly not worse) than pure SFT. You can see MMLU compared for SFT-only vs ORPO-only for TinyLlama and Mistral base models here: https://youtu.be/OWMJ0rBUj04?si=vri_nKp6ZcOWYzoe&t=162 - each on one epoch of ~7000 rows of preference data.
And thanks for the clarification on the training mix avoiding catastrophic loss. I must look up some good base datasets I can use to sprinkle into fine-tunes.
For example, "Research Papers in March 2024: Tips for LLM Pretraining and Evaluating Reward Models". This can help identify your monthly recaps from your longer topical pieces. ALL are excellent; I'm convinced that it's impossible for you to write something that isn't. But it does help a reader better identify what's in play for a specific post.
I have one confusion. Are the authors preferring "Re-warming and re-decaying" or "Infinite LR"?
From the paper I got the idea that Infinite LR is a better choice. However, from the article it seems that the contrary is correct.
An excerpt from the paper:
> We observe that all schedules perform relatively similarly, however, the two infinite schedules have the advantage that we can start annealing at any time during the constant learning rate phase on each split, while the repeated cosine decays require knowing the number of tokens in advance. Additionally, we see negligible forgetting across dataset boundaries for the infinite LR schedules. While the losses initially increase sharply due to re-initializing the optimizer states, the infinite schedules models immediately recover from this.
I love how you can distill complex information in a digestable format for people who doesn't have a strong background on all the technicalities of LLMs, so thank you Sebastian
Thanks so much for the kind words!
RewardBench lead author here, a couple notes:
* We're working on your training correlation caveat :)
* Now we're at the phase where we are getting closed models added to the benchmark to show the gap open needs to close (because good alignment capabilities are important for good societal outcomes.
* The leaderboard is characterized by the design space not being well explored, so more DPO models exist because they're popular. I don't expect this to change too much, but already more people are training RMs since release! (a specific training blog post here: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=21)
Great summary of the benchmark. Keep up the great work.
https://huggingface.co/spaces/allenai/reward-bench
Thanks for the informative comment, and I am looking forward to that correlation analysis (haha, but no rush!)
Well written. Very informative and very clear. Thank you!
Glad to hear this!
Hey Sebastian,
Great article as always. I hate to be that guy, but I figured I'd bring it to your attention. Just 2 minor edits for you, in the "Common 7B Language Models Already Possess Strong Math Capabilities" section you spelled instruction-finetuning wrong. In the "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" there is ) with no matching ( to go with it.
Whoa, that's a good eye for detail. Appreciate it that you were giving a thorough read! And luckily these were two quick & easy fixes. Should be updated now!
Thanks for the nice newsletter.
1. Why is it that this linear warmup and cosine decay avoid catastrophic forgetting?
a) I suppose the linear warm-up helps to ensure that the model doesn't get jolted away from the starting point (by allowing for more smooth setting up of the gradients). It strikes me that just starting with the finishing optimizer states might be sufficient and a better approach to doing this.
b) I don't really see how the cosine decay helps avoid catastrophic forgetting. Seems to me that just helps to hone the optimisation as the optimal point gets closer and smaller tweaks are needed.
2. I assume you've looked at ORPO. I found DPO could be hard to get working robustly, but orpo worked robustly as an alternative to SFT for me - you can see a short video on a comparison I did between SFT and ORPO.
> 1. Why is it that this linear warmup and cosine decay avoid catastrophic forgetting?
The avoiding of catastrophic forgetting comes actually from the replay part, that is, adding a small fraction of the original data to the new training mix. The linear warmup and cosine decay is to stabilize the training and help with convergence.
> 2. I assume you've looked at ORPO.
I actually haven't used ORPO (you mean ORPO: Monolithic Preference Optimization without Reference Model, right?). I agree that DPO can be tricky in practice. It's sometimes hard to find good settings and not under or overtrain it. I kind of missed ORPO and will read into it. Thanks for the recommendation!
Yeah that's the one on ORPO!
For a lot of these types of papers (unfortunately both DoRA and DPO) I run the new method versus SFT as the baseline and get worse (or no better) performance. But ORPO seemed to consistently do a bit better (certainly not worse) than pure SFT. You can see MMLU compared for SFT-only vs ORPO-only for TinyLlama and Mistral base models here: https://youtu.be/OWMJ0rBUj04?si=vri_nKp6ZcOWYzoe&t=162 - each on one epoch of ~7000 rows of preference data.
And thanks for the clarification on the training mix avoiding catastrophic loss. I must look up some good base datasets I can use to sprinkle into fine-tunes.
In section 2.2 this part has to be other way around no?
“that is, models with exactly the same architecture trained on exactly the same dataset but using DPO instead of RLHF with a dedicated reward model”
The linear-half cosine is nearly the same as one cycle training by Leslie smith.
Instead of linear-linear in the latter.
Does the intuition remains the same as with one-cycle training?
It is vastly different for the time to highest learning rate.
I’m one cycle highest LR is halfway
In this LR Schedule highest LR is achieved very early
Suggestion: Go back to your previous format ...
Research Papers in <Month Year>.
For example, "Research Papers in March 2024: Tips for LLM Pretraining and Evaluating Reward Models". This can help identify your monthly recaps from your longer topical pieces. ALL are excellent; I'm convinced that it's impossible for you to write something that isn't. But it does help a reader better identify what's in play for a specific post.
Insightful article as always 🙌🏻
I have one confusion. Are the authors preferring "Re-warming and re-decaying" or "Infinite LR"?
From the paper I got the idea that Infinite LR is a better choice. However, from the article it seems that the contrary is correct.
An excerpt from the paper:
> We observe that all schedules perform relatively similarly, however, the two infinite schedules have the advantage that we can start annealing at any time during the constant learning rate phase on each split, while the repeated cosine decays require knowing the number of tokens in advance. Additionally, we see negligible forgetting across dataset boundaries for the infinite LR schedules. While the losses initially increase sharply due to re-initializing the optimizer states, the infinite schedules models immediately recover from this.