Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.
This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.
Good question. The LoRA idea is kind of inspired by PCA, but we don't decompose the weights via PCA.
If you learned the full weight update matrix, ∆W, as in regular finetuning, you could theoretically decompose that into two smaller matrices, A and B, via PCA. But that would require that you already have the full ∆W matrix, and getting this matrix is expensive (it's requires that you do the full finetuning).
So, instead of learning the full ∆W matrix, you learn the two matrices A and B directly. Or in other words, instead of carrying out the decomposition of a large matrix into two smaller matrices, you learn an approximation of the two smaller matrices directly.
Let me know if this helps clarifying this, happy to try to explain it differently otherwise.
Sorry, for some reason I missed this comment earlier. I think it might be quite expensive to decompose them, but let's assume it works. I think the problem then could be that you lose too much information. In LoRA, you keep all the original weights but then just modify a small subset. Your approach, which is not a bad idea, would probably get rid of too much useful information when you reconstruct the original weight matrix during the forward pass.
When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?
Yes, you could try that, and it could work quite well. There's actually a related idea, called LAWA, where the researchers averaged the models throughout the training trajectory: https://arxiv.org/abs/2306.03241
From the original LoRA paper: " We then scale ∆W x by α / r, where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately.".
So probably the same thinking from learning rate initialization can be transferred here.
These are excellent points, and I found myself asking the same questions. Unfortunately, I don't have a definitive answer.
Empirically, a ratio of 2 for r-to-alpha appears to be optimal in this series of experiments. The exact reason for this is unclear. The short answer is that it essentially acts as a hyperparameter.
This is similar to the rationale behind 0.005 being a commonly used default learning rate for Adam (or 3e-4, depending on whom you ask) based on empirical experience. While this is generally a good rule of thumb, it's not universally applicable, and experimenting with different values can sometimes be beneficial.
As for why a ratio of r=256 to alpha=128 is preferable over, for instance, r=512 & alpha=256 or r=128 & alpha=64, I can only speculate. For this specific dataset and model combination, it seems to provide the ideal balance of weight and weight update strengths. With a larger r, there might be more overfitting, and with smaller r values, more underfitting could occur. However, this hypothesis requires further investigation.
As someone previously mentioned above (based on a quote from the LoRA paper), this problem is more or less similar to scaling the learning rate, but it's not entirely the same. The learning rate only affects training, whereas alpha also influences the magnitude of the weights during inference.
Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))
I cam here to re-read both of your articles as I was reading "Hands-On LLMs" by Jay Alammar and Maarten Grootendorst. It helps a lot especially from the practical perspective. Thank you!
I think I might try reproducing some of your experiments probably on LIMA dataset since I cannot afford to train on Alpaca at the moment.
Also going to read LIMA paper. I know the gist of it, but I am curious how different there curated instructions were compared to others and whether it got adopted by following research.
Generally, we still see the general trend of fine-tuning on vast amount of data.
Thanks for the great article! This helped me a lot :)
I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.
As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?
These are good points, and I unfortunately don't have a good / definitive answer. I think the "number of epochs" is still an open problem. E.g., in the TinyLlama paper (https://arxiv.org/abs/2401.02385) that just came out, they also found that multiple epochs are beneficial for pretraining as well (counter to what people previously though).
So, for now, I'd just try multiple epochs and see what happens on benchmark datasets. What you could do is just keep training and save a model checkpoint after each epoch and then see if it continues to improve on benchmark tasks.
Overall, as a rule of thumb, re "Is it bad to train the model with small dataset size + high epoch?" I'd say the smaller the dataset and the larger the epoch number, the higher the risk of overfitting. Adding more data can reduce overfitting, but so can weight decay (in AdamW), dropout, etc.
The bottom line is, it requires experimentation to find out.
i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.
i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?
Yes, I think r=256 can be very high and the mileage may vary depending on the LLM and dataset. Like a learning rate or batch size, I recommend revisiting it for each new project (new LLM and/or new dataset). However, I think that the alpha = 2*r is still a good rule of thumb though.
I didn't do any experiments with 8-bit quantization here but directly went to 4-bit quantization to take advantage of the NormalFloat datatype that was introduced in the QLoRA paper. Table 3 in the QLoRA paper has a comparison with int8, but it looks like it's basically the same modeling performance as 4-bit, so maybe not worthwhile exploring: https://arxiv.org/abs/2305.14314
Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.
This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.
The article was very well written
Loved it.
Are the weights decomposed using PCA?
Good question. The LoRA idea is kind of inspired by PCA, but we don't decompose the weights via PCA.
If you learned the full weight update matrix, ∆W, as in regular finetuning, you could theoretically decompose that into two smaller matrices, A and B, via PCA. But that would require that you already have the full ∆W matrix, and getting this matrix is expensive (it's requires that you do the full finetuning).
So, instead of learning the full ∆W matrix, you learn the two matrices A and B directly. Or in other words, instead of carrying out the decomposition of a large matrix into two smaller matrices, you learn an approximation of the two smaller matrices directly.
Let me know if this helps clarifying this, happy to try to explain it differently otherwise.
What I understand is
We initialise two weight matrix
One with zero and other with numbers.
Then we calculate the given task with the pretrained weights and the LoRa matrices fused.
The loss is calculated and gradients are propagated only through the LoRa matrices.
Kindly correct me if I’m wrong.
A thought: can we decompose the original weight matrices into two matrices using PCA for the LoRa initialisation.
Or will it be too expensive or outright dumb to do. 😃
Sorry, for some reason I missed this comment earlier. I think it might be quite expensive to decompose them, but let's assume it works. I think the problem then could be that you lose too much information. In LoRA, you keep all the original weights but then just modify a small subset. Your approach, which is not a bad idea, would probably get rid of too much useful information when you reconstruct the original weight matrix during the forward pass.
Maybe that could be helpful for you: https://www.youtube.com/watch?v=dA-NhCtrrVE
When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?
Yes, you could try that, and it could work quite well. There's actually a related idea, called LAWA, where the researchers averaged the models throughout the training trajectory: https://arxiv.org/abs/2306.03241
In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?
From the original LoRA paper: " We then scale ∆W x by α / r, where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately.".
So probably the same thinking from learning rate initialization can be transferred here.
I have the same question. In the table you provided, the combination of r=256 and alpha=128 performed the best. Can you elaborate it more?
These are excellent points, and I found myself asking the same questions. Unfortunately, I don't have a definitive answer.
Empirically, a ratio of 2 for r-to-alpha appears to be optimal in this series of experiments. The exact reason for this is unclear. The short answer is that it essentially acts as a hyperparameter.
This is similar to the rationale behind 0.005 being a commonly used default learning rate for Adam (or 3e-4, depending on whom you ask) based on empirical experience. While this is generally a good rule of thumb, it's not universally applicable, and experimenting with different values can sometimes be beneficial.
As for why a ratio of r=256 to alpha=128 is preferable over, for instance, r=512 & alpha=256 or r=128 & alpha=64, I can only speculate. For this specific dataset and model combination, it seems to provide the ideal balance of weight and weight update strengths. With a larger r, there might be more overfitting, and with smaller r values, more underfitting could occur. However, this hypothesis requires further investigation.
As someone previously mentioned above (based on a quote from the LoRA paper), this problem is more or less similar to scaling the learning rate, but it's not entirely the same. The learning rate only affects training, whereas alpha also influences the magnitude of the weights during inference.
thank you for this great article !
Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))
Glad this was helpful!
Great article
Just thought about sharing some of my recent experiments here:
https://github.com/pytholic/llm-peft-recipes/blob/main/lora_insights.ipynb
Awesome, thanks for sharing. Just exported this to my e-reader and will give it a thorough read in the upcoming weeks!
I cam here to re-read both of your articles as I was reading "Hands-On LLMs" by Jay Alammar and Maarten Grootendorst. It helps a lot especially from the practical perspective. Thank you!
I think I might try reproducing some of your experiments probably on LIMA dataset since I cannot afford to train on Alpaca at the moment.
You can also always trim the dataset. You can also generate a high-quality dataset with Llama 3, I have a code example for this here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
Thank you 🙏🏻
Also going to read LIMA paper. I know the gist of it, but I am curious how different there curated instructions were compared to others and whether it got adopted by following research.
Generally, we still see the general trend of fine-tuning on vast amount of data.
hi. what do you mean by `static dataset`?
Good point, I could have been more clear on that. I meant datasets that are fixed and are not updated with more examples or new info.
Wonderful
Great article and well written. Thanks a lot!
Thanks for the kind words!
Thanks for the great article! This helped me a lot :)
I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.
As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?
I want to know your opinion thanks :)
These are good points, and I unfortunately don't have a good / definitive answer. I think the "number of epochs" is still an open problem. E.g., in the TinyLlama paper (https://arxiv.org/abs/2401.02385) that just came out, they also found that multiple epochs are beneficial for pretraining as well (counter to what people previously though).
So, for now, I'd just try multiple epochs and see what happens on benchmark datasets. What you could do is just keep training and save a model checkpoint after each epoch and then see if it continues to improve on benchmark tasks.
Overall, as a rule of thumb, re "Is it bad to train the model with small dataset size + high epoch?" I'd say the smaller the dataset and the larger the epoch number, the higher the risk of overfitting. Adding more data can reduce overfitting, but so can weight decay (in AdamW), dropout, etc.
The bottom line is, it requires experimentation to find out.
Thank you so much for kind reply!! Really appreciate it :)
thank you for sharing your insights.
i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.
i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?
Yes, I think r=256 can be very high and the mileage may vary depending on the LLM and dataset. Like a learning rate or batch size, I recommend revisiting it for each new project (new LLM and/or new dataset). However, I think that the alpha = 2*r is still a good rule of thumb though.
How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?
I didn't do any experiments with 8-bit quantization here but directly went to 4-bit quantization to take advantage of the NormalFloat datatype that was introduced in the QLoRA paper. Table 3 in the QLoRA paper has a comparison with int8, but it looks like it's basically the same modeling performance as 4-bit, so maybe not worthwhile exploring: https://arxiv.org/abs/2305.14314
Just a thought regarding "Q9: Can LoRA Weights be Combined?".
If I am not mistaken, this can be / is a common approach in RLHF? First we might merge adapters from SFT and then from PPO/DPO (Preference Tuning)?