Practical Tips for Finetuning LLMs Using LoRA…

Sebastian Raschka, PhD

Nov 19, 2023

Things I Learned From Hundreds of Experiments

48 Comments

Sebastian Raschka, PhD

Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.

This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.

Expand full comment

The article was very well written

Loved it.

Are the weights decomposed using PCA?

Expand full comment

Sebastian Raschka, PhD

Good question. The LoRA idea is kind of inspired by PCA, but we don't decompose the weights via PCA.

If you learned the full weight update matrix, ∆W, as in regular finetuning, you could theoretically decompose that into two smaller matrices, A and B, via PCA. But that would require that you already have the full ∆W matrix, and getting this matrix is expensive (it's requires that you do the full finetuning).

So, instead of learning the full ∆W matrix, you learn the two matrices A and B directly. Or in other words, instead of carrying out the decomposition of a large matrix into two smaller matrices, you learn an approximation of the two smaller matrices directly.

Let me know if this helps clarifying this, happy to try to explain it differently otherwise.

Expand full comment

What I understand is

We initialise two weight matrix

One with zero and other with numbers.

Then we calculate the given task with the pretrained weights and the LoRa matrices fused.

The loss is calculated and gradients are propagated only through the LoRa matrices.

Kindly correct me if I’m wrong.

A thought: can we decompose the original weight matrices into two matrices using PCA for the LoRa initialisation.

Or will it be too expensive or outright dumb to do. 😃

Expand full comment

Sebastian Raschka, PhD

Sorry, for some reason I missed this comment earlier. I think it might be quite expensive to decompose them, but let's assume it works. I think the problem then could be that you lose too much information. In LoRA, you keep all the original weights but then just modify a small subset. Your approach, which is not a bad idea, would probably get rid of too much useful information when you reconstruct the original weight matrix during the forward pass.

Expand full comment

Maybe that could be helpful for you: https://www.youtube.com/watch?v=dA-NhCtrrVE

Expand full comment

When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?

Expand full comment

Sebastian Raschka, PhD

Yes, you could try that, and it could work quite well. There's actually a related idea, called LAWA, where the researchers averaged the models throughout the training trajectory: https://arxiv.org/abs/2306.03241

Expand full comment

In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?

Expand full comment

From the original LoRA paper: " We then scale ∆W x by α / r, where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately.".

So probably the same thinking from learning rate initialization can be transferred here.

Expand full comment

I have the same question. In the table you provided, the combination of r=256 and alpha=128 performed the best. Can you elaborate it more?

Expand full comment

Sebastian Raschka, PhD

These are excellent points, and I found myself asking the same questions. Unfortunately, I don't have a definitive answer.

Empirically, a ratio of 2 for r-to-alpha appears to be optimal in this series of experiments. The exact reason for this is unclear. The short answer is that it essentially acts as a hyperparameter.

This is similar to the rationale behind 0.005 being a commonly used default learning rate for Adam (or 3e-4, depending on whom you ask) based on empirical experience. While this is generally a good rule of thumb, it's not universally applicable, and experimenting with different values can sometimes be beneficial.

As for why a ratio of r=256 to alpha=128 is preferable over, for instance, r=512 & alpha=256 or r=128 & alpha=64, I can only speculate. For this specific dataset and model combination, it seems to provide the ideal balance of weight and weight update strengths. With a larger r, there might be more overfitting, and with smaller r values, more underfitting could occur. However, this hypothesis requires further investigation.

As someone previously mentioned above (based on a quote from the LoRA paper), this problem is more or less similar to scaling the learning rate, but it's not entirely the same. The learning rate only affects training, whereas alpha also influences the magnitude of the weights during inference.

Expand full comment

thank you for this great article !

Expand full comment

Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))

Expand full comment

Sebastian Raschka, PhD

Glad this was helpful!

Expand full comment

Great article

Expand full comment

For Q3: Selecting the best Rank, would this method from Unlosth make sense? ("The rank is approximated so that the number of trainable parameters is roughly equivalent to the number of tokens in your dataset. This is based on the idea of scaling laws, and should be helpful for when you change between models and datasets." https://colab.research.google.com/drive/1njCCbE1YVal9xC83hjdo2hiGItpY_D6t?usp=sharing

Also wondering if the equation in this paper (https://arxiv.org/abs/2406.02290v2) offers a good rule of thumb for calculating memory requirement in fine-tuning

Expand full comment

Sebastian Raschka, PhD

I think this is quite intuitive regarding the rank. My guess is though it probably is not exactly that because often it's less about taking up new knowledge but just steering the behavior. It would be an interesting study to try this on different sized datasets though and see if that generally holds. Super intersting, thanks for sharing!

I haven't read the other paper, yet and can't comment on it yet. But yeah, there are many factors aside from even the model size. Even if you use FSDP, there are many options in FSDP that influence memory usage. Also, I think bf16 with a GPU that has tensor cores (like a V100) and a GPU that doesn't (like the V100, which they used in Table 2) also influences things if you activate tensor core usage in PyTorch etc.

We had a discussion around this topic a while back here: https://github.com/Lightning-AI/litgpt/issues/921

Expand full comment

Just thought about sharing some of my recent experiments here:

https://github.com/pytholic/llm-peft-recipes/blob/main/lora_insights.ipynb

Expand full comment

Sebastian Raschka, PhD

Awesome, thanks for sharing. Just exported this to my e-reader and will give it a thorough read in the upcoming weeks!

Expand full comment

Sep 24, 2024Edited

I cam here to re-read both of your articles as I was reading "Hands-On LLMs" by Jay Alammar and Maarten Grootendorst. It helps a lot especially from the practical perspective. Thank you!

I think I might try reproducing some of your experiments probably on LIMA dataset since I cannot afford to train on Alpaca at the moment.

Expand full comment

Sebastian Raschka, PhD

You can also always trim the dataset. You can also generate a high-quality dataset with Llama 3, I have a code example for this here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb

Expand full comment

Thank you 🙏🏻

Also going to read LIMA paper. I know the gist of it, but I am curious how different there curated instructions were compared to others and whether it got adopted by following research.

Generally, we still see the general trend of fine-tuning on vast amount of data.

Expand full comment

hi. what do you mean by `static dataset`?

Expand full comment

Sebastian Raschka, PhD

Good point, I could have been more clear on that. I meant datasets that are fixed and are not updated with more examples or new info.

Expand full comment

Wonderful

Expand full comment

Great article and well written. Thanks a lot!

Expand full comment

Sebastian Raschka, PhD

Thanks for the kind words!

Expand full comment

Thanks for the great article! This helped me a lot :)

I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.

As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?

I want to know your opinion thanks :)

Expand full comment

Sebastian Raschka, PhD

These are good points, and I unfortunately don't have a good / definitive answer. I think the "number of epochs" is still an open problem. E.g., in the TinyLlama paper (https://arxiv.org/abs/2401.02385) that just came out, they also found that multiple epochs are beneficial for pretraining as well (counter to what people previously though).

So, for now, I'd just try multiple epochs and see what happens on benchmark datasets. What you could do is just keep training and save a model checkpoint after each epoch and then see if it continues to improve on benchmark tasks.

Overall, as a rule of thumb, re "Is it bad to train the model with small dataset size + high epoch?" I'd say the smaller the dataset and the larger the epoch number, the higher the risk of overfitting. Adding more data can reduce overfitting, but so can weight decay (in AdamW), dropout, etc.

The bottom line is, it requires experimentation to find out.

Expand full comment

Thank you so much for kind reply!! Really appreciate it :)

Expand full comment

thank you for sharing your insights.

i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.

i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?

Expand full comment

Sebastian Raschka, PhD

Yes, I think r=256 can be very high and the mileage may vary depending on the LLM and dataset. Like a learning rate or batch size, I recommend revisiting it for each new project (new LLM and/or new dataset). However, I think that the alpha = 2*r is still a good rule of thumb though.

Expand full comment

How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?

Expand full comment

Sebastian Raschka, PhD

I didn't do any experiments with 8-bit quantization here but directly went to 4-bit quantization to take advantage of the NormalFloat datatype that was introduced in the QLoRA paper. Table 3 in the QLoRA paper has a comparison with int8, but it looks like it's basically the same modeling performance as 4-bit, so maybe not worthwhile exploring: https://arxiv.org/abs/2305.14314

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts