Thank you! This is an incredible intro to understanding fine tuning. I’m curious though how these different approaches impact model output/performance. 1. What are the a ways that researchers assess output/performance 2. How different is performance between in context v. Indexing v. Retraining etc.
They are all great questions! Regarding the modeling performance, it depends on the task. For classification, prediction accuracy would be straight forward. For summarization, BLEU or BERTScore, and for translation ROUGE. They are flawed metrics though, and I will cover this (with examples) some other time in the future. For Q&A, the gold standard remains human preference, which is hard to automate.
There are so many intertwined topics ... but I hope I can address them eventually, one at a time, in the upcoming months.
(PS: It's a also hard to compare in-context vs retraining for things like ChatGPT because ChatGPT does not give you access to the model itself, so these experiments need to be done with other types of models, e.g., LLaMA/Alpaca/Dolly etc.)
Hi Sebastian, I am rediscovering this post after running into some issues with fine tuning. Thanks for the awesome post!
I am particularly interested in your opinions on fine tuning all layers vs fine tuning the last layer (maybe plus gradual unfreezing) for repurposing the pretrained model, e.g., for training reward models.
You mentioned in another post that the most popular method nowadays is to fine tune all layers all together (i.e., gradual unfreezing as in UMLfit is out of date). But could you explain why it makes sense? Intuitively, when we add a linear layer to the pretrained backbone to learn reward for example, and that we use the same very small learning rate (e.g. 1e-5) for both the backbone and the linear layer, the linear layer is basically not changing, so we are pretty much adapting the backbone representation to fit random weights in the linear layer?
Hi there, with regard to updating all layers, I was probably thinking of workflows that involve instruction finetuning (SFT) either via full parameter updates or LoRA. The other example is BERT-style classifiers, which are usually much smaller transformers compared to LLMs used for instruction finetuning etc.
For reward modeling, you are right, attaching and training only the output layer should be sufficient; at least that's what I see most often done in pratice.
This is a nice article. But one thing I did not understand is the comment on "keep frozen" for feature based approach. In this technique I understood that we are not doing any fine tuning of the language model. We are just taking the output of the LM and using that to train a new model. So, there is no question of updating the existing model weights. Is my understanding right? Also for this approach, I think I can only use the embedding models like text-ada-embedding. I will not be able to use GPT 2.5 or DaVinci, is my understanding correct?
Yes, that's correct. We are just using the encodings from the model by removing the output layer (which usually returns class labels). We either pass these encodings to a new model (could even be XGBoost or some other model that is not neural network related). Or, we could append 1 or more output layers that are then updated (here the new layers represent the "new" model).
Many thanks for writing this, it’s one of the first posts I’ve seen that is starting from a first principles approach. I wanted to ask if you plan to write code tutorials for how to implement the methods mentioned in your posts? Learning to fine tune models with custom data seems to be a very valuable skill and I was wondering if you could point to some materials - either your own or others where people could actually practice this.
Your articles coupled with annotated notebooks or examples would be a great combination!
thanks for the comment! This is a good idea, and I was indeed planning to put together more material on this involving code. Related to that, the Adapter finetuning article I posted on Saturday (https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters) has a code implementation at the bottom. I'd say Adapter is a good one to get started since it's probably the simplest one.
Adopting LLMs to specific target tasks. This could be specific datasets or specific tasks. What's nice about most of these methods is that they leave the original parameters of the model unmodified. E.g., if you are using a huge pretrained LLM at a company, for example, you only need 1 copy of it and then use PEFT methods to store a small number of weights for each target task or dataset.
Hi Sebastian, I love your content. Always on point as usual. It would be nice if you could go a little deeper into how to compile your training dataset if you want to fine-tune your own LLM. I read a lot about fine-tuning algorithms of LLMs such as prefix/prompt tuning, adapters or lora. But I think structuring and fine-tuning your underlying dataset is just as important. Say you have millions/billions of observations, how do you decide which data points to include and how to format your instructions for example? It would be nice if you went into that sometime. I think a lot of the work you have to do in order to get a good performance out of your LLM finetuning will be based on the underlying dataset you use to finetune your LLM.
Absolutely. That's a good point. In the upcoming Ahead of AI issue, I briefly discuss Pythia (deduplication is not having a significant effect) and RedPajama (a curated open-source dataset for pretraining).
But then, as you said, it would be interesting to talk more about finetuning datasets specifically. I think that's where we also have to distinguish more between the different finetuning tasks, like instruction-finetuning, finetuning for predictive modeling, etc. Currently, the literature is pretty scarce since people just train on what they can get their hands on. But I echo your points and think this is an interesting and important topic.
I hope to see more literature on this once the initial dust settles, when people start analyzing things more carefully after releasing LLMs as fast as possible.
Thanks for your answer. Yeah, I think dataset finetuning for specific tasks is very important. I found this blog post, which I think is a very good example of how much work is involved in structuring a dataset for LLM finetuning: https://www.flowrite.com/blog/dataset-engineering-llm-finetuning
I am thinking and reading a lot about this subject lately.
Thanks for the post. Most of the questions are answered.
Can you please clarify the difference between hard prompt tuning (assuming it’s prompt engineering, where we manually create prompts with some modifications etc., is that correct?) and soft prompt tuning.
Also, in soft prompt tuning, how can we differentiate between different tasks (QA, Summarization, classification etc.,). Will we be using different vocabulary for different tasks.
you are correct, in "hard" prompt tuning, we manually create prompts, although you can also use an algorithmic approach to creating the prompts. I remember there was even a paper (can't find it off the top of my head right now) where they trained a classifier to score the quality of the prompts. While you are right, I want to add that the emphasis here is on "discrete" prompts. It's called "hard" as opposed to "soft" because it's not differentiable. E.g., you can think of it as a classification problem where you want to modify the input text so that the class score or label changes. E.g., say we have a sentiment classification task with "I found this movie interesting" as the input text, and the the probability score is p(positive|text) = 0.6. We can then try out different words, e.g., changing "interesting -> "great" to change the score from 0.6 to 0.8. The words are not differentiable, so we have to try out different ones to see what works.
In soft prompt tuning, we then work with the embeddings of the words, that is, their numeric vector representations. And those are differentiable, i.e., we can compute the gradient of the loss with respect to the embedding. In soft prompt tuning, we are not modifying the word embeddings themselves though, but we prepend additional embedded prompt.
E.g., whereas in hard prompt tuning, you can modify a prompt "what is 3 + 4?" to "calculate: what is 3 + 4?", in soft prompt tuning you would do the same with the embeddings.
E.g., if "what is 3 + 4?" is a 6x20 dimensional matrix, (6 tokens, and each token has a 20-dimensional vector representation), then "calculate: what is 3 + 4?" would be 7x20-dimensional. In soft prompt tuning, instead of adding the word "classify" manually to the input prompt, you want to find the optimal embedding, the 1x20 dimensional vector here.
So, regarding your question "how can we differentiate between different tasks (QA, Summarization, classification etc.,)", for each of these tasks, you would learn a different 1x20 embedding vector. The embedding vector is specific to the task you optimize here:
- gradient of Loss(Q&A performance) with respect to 1x20 embedding vector.
- gradient of Loss(Summarization performance) with respect to 1x20 embedding vector.
Thanks for the feedback. So the practical context here would be classification (e.g., for the code examples I shared), to keep it simple. I will add a few sentences to clarify that!
I would see SetFit as a flavor of parameter-efficient finetuning -- I will cover more finetuning techniques in the future. One at a time :)
Added a new paragraph that hopefully makes things more clear for the time being.
> To provide some practical context for the discussions below, we are finetuning an encoder-style LLM such as BERT (Devlin et al. 2018) for a classification task. (To keep things simple, this classification task predicts whether a movie review has a positive or negative sentiment.) Note that instead of finetuning an encoder-style LLM, the same approach would work for GPT-like decoder-style LLMs, and I will provide an example of this in a future article. Furthermore, we can also finetuning decoder-style LLMs to generate multiple-sentence answers to specific instructions instead of just classifying texts. Also, for this, I will provide hands-on examples in future articles.
The diagram presented in "indexing" section does not show the role of LLM in the indexing process. One have to guess that "embedding" black box is the place where the LLM takes part with one of the available vectorizing algorithms it provides.
German translation doesn't need any special prompting. just say "translate this to German: good morning" and it will work. Your article states other wise
Thank you! This is an incredible intro to understanding fine tuning. I’m curious though how these different approaches impact model output/performance. 1. What are the a ways that researchers assess output/performance 2. How different is performance between in context v. Indexing v. Retraining etc.
They are all great questions! Regarding the modeling performance, it depends on the task. For classification, prediction accuracy would be straight forward. For summarization, BLEU or BERTScore, and for translation ROUGE. They are flawed metrics though, and I will cover this (with examples) some other time in the future. For Q&A, the gold standard remains human preference, which is hard to automate.
There are so many intertwined topics ... but I hope I can address them eventually, one at a time, in the upcoming months.
(PS: It's a also hard to compare in-context vs retraining for things like ChatGPT because ChatGPT does not give you access to the model itself, so these experiments need to be done with other types of models, e.g., LLaMA/Alpaca/Dolly etc.)
Hi Sebastian, I am rediscovering this post after running into some issues with fine tuning. Thanks for the awesome post!
I am particularly interested in your opinions on fine tuning all layers vs fine tuning the last layer (maybe plus gradual unfreezing) for repurposing the pretrained model, e.g., for training reward models.
You mentioned in another post that the most popular method nowadays is to fine tune all layers all together (i.e., gradual unfreezing as in UMLfit is out of date). But could you explain why it makes sense? Intuitively, when we add a linear layer to the pretrained backbone to learn reward for example, and that we use the same very small learning rate (e.g. 1e-5) for both the backbone and the linear layer, the linear layer is basically not changing, so we are pretty much adapting the backbone representation to fit random weights in the linear layer?
Thanks ahead for your reply!
Hi there, with regard to updating all layers, I was probably thinking of workflows that involve instruction finetuning (SFT) either via full parameter updates or LoRA. The other example is BERT-style classifiers, which are usually much smaller transformers compared to LLMs used for instruction finetuning etc.
For reward modeling, you are right, attaching and training only the output layer should be sufficient; at least that's what I see most often done in pratice.
This is a nice article. But one thing I did not understand is the comment on "keep frozen" for feature based approach. In this technique I understood that we are not doing any fine tuning of the language model. We are just taking the output of the LM and using that to train a new model. So, there is no question of updating the existing model weights. Is my understanding right? Also for this approach, I think I can only use the embedding models like text-ada-embedding. I will not be able to use GPT 2.5 or DaVinci, is my understanding correct?
Yes, that's correct. We are just using the encodings from the model by removing the output layer (which usually returns class labels). We either pass these encodings to a new model (could even be XGBoost or some other model that is not neural network related). Or, we could append 1 or more output layers that are then updated (here the new layers represent the "new" model).
Hi Sebastian,
Many thanks for writing this, it’s one of the first posts I’ve seen that is starting from a first principles approach. I wanted to ask if you plan to write code tutorials for how to implement the methods mentioned in your posts? Learning to fine tune models with custom data seems to be a very valuable skill and I was wondering if you could point to some materials - either your own or others where people could actually practice this.
Your articles coupled with annotated notebooks or examples would be a great combination!
Hi Junaid,
thanks for the comment! This is a good idea, and I was indeed planning to put together more material on this involving code. Related to that, the Adapter finetuning article I posted on Saturday (https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters) has a code implementation at the bottom. I'd say Adapter is a good one to get started since it's probably the simplest one.
Hi Sebastian, thanks for writing this! Big fan of your work. What are the real world use cases you are seeing for PEFT?
Adopting LLMs to specific target tasks. This could be specific datasets or specific tasks. What's nice about most of these methods is that they leave the original parameters of the model unmodified. E.g., if you are using a huge pretrained LLM at a company, for example, you only need 1 copy of it and then use PEFT methods to store a small number of weights for each target task or dataset.
Hi Sebastian, I love your content. Always on point as usual. It would be nice if you could go a little deeper into how to compile your training dataset if you want to fine-tune your own LLM. I read a lot about fine-tuning algorithms of LLMs such as prefix/prompt tuning, adapters or lora. But I think structuring and fine-tuning your underlying dataset is just as important. Say you have millions/billions of observations, how do you decide which data points to include and how to format your instructions for example? It would be nice if you went into that sometime. I think a lot of the work you have to do in order to get a good performance out of your LLM finetuning will be based on the underlying dataset you use to finetune your LLM.
Absolutely. That's a good point. In the upcoming Ahead of AI issue, I briefly discuss Pythia (deduplication is not having a significant effect) and RedPajama (a curated open-source dataset for pretraining).
But then, as you said, it would be interesting to talk more about finetuning datasets specifically. I think that's where we also have to distinguish more between the different finetuning tasks, like instruction-finetuning, finetuning for predictive modeling, etc. Currently, the literature is pretty scarce since people just train on what they can get their hands on. But I echo your points and think this is an interesting and important topic.
I hope to see more literature on this once the initial dust settles, when people start analyzing things more carefully after releasing LLMs as fast as possible.
Thanks for your answer. Yeah, I think dataset finetuning for specific tasks is very important. I found this blog post, which I think is a very good example of how much work is involved in structuring a dataset for LLM finetuning: https://www.flowrite.com/blog/dataset-engineering-llm-finetuning
I am thinking and reading a lot about this subject lately.
Haven't seen this article, yet. Thanks for sharing!
Thanks for the post.
Thanks for the post. Most of the questions are answered.
Can you please clarify the difference between hard prompt tuning (assuming it’s prompt engineering, where we manually create prompts with some modifications etc., is that correct?) and soft prompt tuning.
Also, in soft prompt tuning, how can we differentiate between different tasks (QA, Summarization, classification etc.,). Will we be using different vocabulary for different tasks.
Hi Anand,
you are correct, in "hard" prompt tuning, we manually create prompts, although you can also use an algorithmic approach to creating the prompts. I remember there was even a paper (can't find it off the top of my head right now) where they trained a classifier to score the quality of the prompts. While you are right, I want to add that the emphasis here is on "discrete" prompts. It's called "hard" as opposed to "soft" because it's not differentiable. E.g., you can think of it as a classification problem where you want to modify the input text so that the class score or label changes. E.g., say we have a sentiment classification task with "I found this movie interesting" as the input text, and the the probability score is p(positive|text) = 0.6. We can then try out different words, e.g., changing "interesting -> "great" to change the score from 0.6 to 0.8. The words are not differentiable, so we have to try out different ones to see what works.
In soft prompt tuning, we then work with the embeddings of the words, that is, their numeric vector representations. And those are differentiable, i.e., we can compute the gradient of the loss with respect to the embedding. In soft prompt tuning, we are not modifying the word embeddings themselves though, but we prepend additional embedded prompt.
E.g., whereas in hard prompt tuning, you can modify a prompt "what is 3 + 4?" to "calculate: what is 3 + 4?", in soft prompt tuning you would do the same with the embeddings.
E.g., if "what is 3 + 4?" is a 6x20 dimensional matrix, (6 tokens, and each token has a 20-dimensional vector representation), then "calculate: what is 3 + 4?" would be 7x20-dimensional. In soft prompt tuning, instead of adding the word "classify" manually to the input prompt, you want to find the optimal embedding, the 1x20 dimensional vector here.
So, regarding your question "how can we differentiate between different tasks (QA, Summarization, classification etc.,)", for each of these tasks, you would learn a different 1x20 embedding vector. The embedding vector is specific to the task you optimize here:
- gradient of Loss(Q&A performance) with respect to 1x20 embedding vector.
- gradient of Loss(Summarization performance) with respect to 1x20 embedding vector.
- etc.
I hope this helps!
some practical context would be nice, e.g., where does setfit fit etc.
Thanks for the feedback. So the practical context here would be classification (e.g., for the code examples I shared), to keep it simple. I will add a few sentences to clarify that!
I would see SetFit as a flavor of parameter-efficient finetuning -- I will cover more finetuning techniques in the future. One at a time :)
Added a new paragraph that hopefully makes things more clear for the time being.
> To provide some practical context for the discussions below, we are finetuning an encoder-style LLM such as BERT (Devlin et al. 2018) for a classification task. (To keep things simple, this classification task predicts whether a movie review has a positive or negative sentiment.) Note that instead of finetuning an encoder-style LLM, the same approach would work for GPT-like decoder-style LLMs, and I will provide an example of this in a future article. Furthermore, we can also finetuning decoder-style LLMs to generate multiple-sentence answers to specific instructions instead of just classifying texts. Also, for this, I will provide hands-on examples in future articles.
The diagram presented in "indexing" section does not show the role of LLM in the indexing process. One have to guess that "embedding" black box is the place where the LLM takes part with one of the available vectorizing algorithms it provides.
That's a fair point. I have a different graphic here that is a bit more detailed: https://raw.githubusercontent.com/rasbt/RAGs/main/images/overview.webp
But it also doesn't show the LLM in the embedder portion. That's because the embedder doesn't necessarily have to be an LLM.
Thank you for the diagram and explanation. Also a great overview of the topic.
German translation doesn't need any special prompting. just say "translate this to German: good morning" and it will work. Your article states other wise
Depends on the LLM I'd say.
Somehow missed this earlier, but I responded to your email this morning!