What are the different ways to use and finetune pretrained large language models (LLMs)? The most common ways to use and finetune pretrained LLMs include a feature-based approach, in-context prompting, and updating a subset of the model parameters.
> When does it make more sense to use in-context learning rather than fine-tuning, and vice versa?
I think fine-tuning makes sense when we need domain adaptation and have enough data + resources for it. However, 1) if we want to evaluate/test the model's few-shot performance 2) do not have sufficient data/resources 3) or domain adaptation requirements, then in-context learning can suffice.
Definitely makes sense to always test in-context performance before jumping into fine-tuning.
---
> In prefix tuning, adapters, and LoRA, how can we ensure that the model preserves (and does not forget) the original knowledge?
I can think of two approaches (on top of my head).
1) I think we can use something like the KL Divergence Shift Penalty (similar to how we handle the Reward Hacking in PPO-RLHF) problem.
2) We can maybe use a small amount (1-5%) of the original dataset along with the fine-tuning dataset.
That sounds about right! To the first point, I'd add that indexing/retrieval augmented generation could be another alternative that falls between in-context learning and finetuning. It's especially useful if the info is too large to fit it into the context.
Regarding your second point, I like the KL divergence idea from PPO, but including a small percentage of the original data is probably the safer bet. In https://arxiv.org/abs/2403.08763 they found it was quite effective.
Regarding RAG, my knowledge is pretty limited so I am guessing the simpler form would be to load the new data, chunk it to fit in the context window, and store it in a vector DB. Seems like a pretty efficient way to get performance bumps in some cases.
But then it seems like we have to pay much attention to database?
Could you expand, with any suggested reading, on this quote:
“Generally, in-context learning does not perform as well as finetuning for certain tasks or specific datasets since it relies on the pretrained model’s ability to generalize from its training data without further adapting its parameters for the particular task at hand.”
Separately, ReFT is probably an improvement on prefix based approaches. It involves adding trainable low rank representations to hidden layer features.
Sure, I'm happy to elaborate more. Or, maybe it's even easier to explain with a concrete example. Suppose you want to implement a model that predicts whether an email is spam or not. If you take a pretrained LLM and finetune it on a large, labeled spam dataset (as a binary classifier to predict spam/not-spam), that LLM will likely be more accurate than the pretrained LLM via in-context learning. (I am currently implementing this finetuning in chapter 6 of my Build a Large Language Model from Scratch book.)
Let me know if you have any follow-up concerns or questions. Also, thanks for sharing the ReFT paper reference ("ReFT: Reasoning with Reinforced Fine-Tuning" https://arxiv.org/abs/2401.08967 if others are curious). This might be an interesting candidate for a from-scratch implementation some time!
Digging further, I'm curious why? Why is it that fine-tuning might be better than putting that data in the prompt?
Perhaps the key difference is simply that fine-tuning will minimise the cross-entropy loss on the the examples, whereas adding the examples to the prompt cannot? As such, the "prompt only" model can only be directed towards a certain part of its statistical distribution. Its statistical distribution itself cannot be changed by the prompt.
I'd say that's because available models have usually not a) explicitly been trained on a target task and b) seeing a large(r) number of examples during finetuning helps focusing the model.
And yes, like you said, this involves the weight changes.
Testing my knowledge here haha.
> When does it make more sense to use in-context learning rather than fine-tuning, and vice versa?
I think fine-tuning makes sense when we need domain adaptation and have enough data + resources for it. However, 1) if we want to evaluate/test the model's few-shot performance 2) do not have sufficient data/resources 3) or domain adaptation requirements, then in-context learning can suffice.
Definitely makes sense to always test in-context performance before jumping into fine-tuning.
---
> In prefix tuning, adapters, and LoRA, how can we ensure that the model preserves (and does not forget) the original knowledge?
I can think of two approaches (on top of my head).
1) I think we can use something like the KL Divergence Shift Penalty (similar to how we handle the Reward Hacking in PPO-RLHF) problem.
2) We can maybe use a small amount (1-5%) of the original dataset along with the fine-tuning dataset.
That sounds about right! To the first point, I'd add that indexing/retrieval augmented generation could be another alternative that falls between in-context learning and finetuning. It's especially useful if the info is too large to fit it into the context.
Regarding your second point, I like the KL divergence idea from PPO, but including a small percentage of the original data is probably the safer bet. In https://arxiv.org/abs/2403.08763 they found it was quite effective.
Regarding RAG, my knowledge is pretty limited so I am guessing the simpler form would be to load the new data, chunk it to fit in the context window, and store it in a vector DB. Seems like a pretty efficient way to get performance bumps in some cases.
But then it seems like we have to pay much attention to database?
Actually, for most practical use cases you don't even need a database but can store it in memory. I have an example here: https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/03_retrieval-augmented-generation/retrieval-augmented-generation.ipynb
Thank you for sharing this 🙌🏻
These notebooks look really helpful for starters. I hope I can make some time to play with them soon!
Could you expand, with any suggested reading, on this quote:
“Generally, in-context learning does not perform as well as finetuning for certain tasks or specific datasets since it relies on the pretrained model’s ability to generalize from its training data without further adapting its parameters for the particular task at hand.”
Separately, ReFT is probably an improvement on prefix based approaches. It involves adding trainable low rank representations to hidden layer features.
Sure, I'm happy to elaborate more. Or, maybe it's even easier to explain with a concrete example. Suppose you want to implement a model that predicts whether an email is spam or not. If you take a pretrained LLM and finetune it on a large, labeled spam dataset (as a binary classifier to predict spam/not-spam), that LLM will likely be more accurate than the pretrained LLM via in-context learning. (I am currently implementing this finetuning in chapter 6 of my Build a Large Language Model from Scratch book.)
Let me know if you have any follow-up concerns or questions. Also, thanks for sharing the ReFT paper reference ("ReFT: Reasoning with Reinforced Fine-Tuning" https://arxiv.org/abs/2401.08967 if others are curious). This might be an interesting candidate for a from-scratch implementation some time!
Thanks, that's a helpful example.
Digging further, I'm curious why? Why is it that fine-tuning might be better than putting that data in the prompt?
Perhaps the key difference is simply that fine-tuning will minimise the cross-entropy loss on the the examples, whereas adding the examples to the prompt cannot? As such, the "prompt only" model can only be directed towards a certain part of its statistical distribution. Its statistical distribution itself cannot be changed by the prompt.
I'd say that's because available models have usually not a) explicitly been trained on a target task and b) seeing a large(r) number of examples during finetuning helps focusing the model.
And yes, like you said, this involves the weight changes.
Btw. here are some experiments finetuning a gpt-2 model for classification if you are interested: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments
Another great article! I loved the vast topics covered and the clear way of describing them! Keep it up!!
Wanted to point out the the link to the RAG examples is broken.
Ref:
> Interested readers can find code examples illustrating LLM Indexing and retrieval augmented generation here.
Might want to update it to:
https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/03_retrieval-augmented-generation
Thanks for the note, just update the link!
Looks like a typo here
> ΔW=WAWB, where WA ∈ ℝA×h and WA ∈ ℝh×B.
Should be WB ∈ ℝh×B. instead I guess.
Also could you talk more about reward model in RLHF? What that model is composed of how it is used to fine tune a base model?