
Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.

This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.

Nov 20, 2023

The article was very well written

Loved it.

Are the weights decomposed using PCA?

Good question. The LoRA idea is kind of inspired by PCA, but we don't decompose the weights via PCA.

If you learned the full weight update matrix, ∆W, as in regular finetuning, you could theoretically decompose that into two smaller matrices, A and B, via PCA. But that would require that you already have the full ∆W matrix, and getting this matrix is expensive (it's requires that you do the full finetuning).

So, instead of learning the full ∆W matrix, you learn the two matrices A and B directly. Or in other words, instead of carrying out the decomposition of a large matrix into two smaller matrices, you learn an approximation of the two smaller matrices directly.

Let me know if this helps clarifying this, happy to try to explain it differently otherwise.

What I understand is

We initialise two weight matrix

One with zero and other with numbers.

Then we calculate the given task with the pretrained weights and the LoRa matrices fused.

The loss is calculated and gradients are propagated only through the LoRa matrices.

Kindly correct me if I’m wrong.

A thought: can we decompose the original weight matrices into two matrices using PCA for the LoRa initialisation.

Or will it be too expensive or outright dumb to do. 😃

Sorry, for some reason I missed this comment earlier. I think it might be quite expensive to decompose them, but let's assume it works. I think the problem then could be that you lose too much information. In LoRA, you keep all the original weights but then just modify a small subset. Your approach, which is not a bad idea, would probably get rid of too much useful information when you reconstruct the original weight matrix during the forward pass.

Jan 26

Maybe that could be helpful for you: https://www.youtube.com/watch?v=dA-NhCtrrVE

Nov 20, 2023

When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?

Yes, you could try that, and it could work quite well. There's actually a related idea, called LAWA, where the researchers averaged the models throughout the training trajectory: https://arxiv.org/abs/2306.03241

Nov 20, 2023

In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?

Nov 20, 2023

From the original LoRA paper: " We then scale ∆W x by α / r, where α is a constant in r. When optimizing with Adam, tuning α is roughly the same as tuning the learning rate if we scale the initialization appropriately.".

So probably the same thinking from learning rate initialization can be transferred here.

Nov 20, 2023

I have the same question. In the table you provided, the combination of r=256 and alpha=128 performed the best. Can you elaborate it more?

These are excellent points, and I found myself asking the same questions. Unfortunately, I don't have a definitive answer.

Empirically, a ratio of 2 for r-to-alpha appears to be optimal in this series of experiments. The exact reason for this is unclear. The short answer is that it essentially acts as a hyperparameter.

This is similar to the rationale behind 0.005 being a commonly used default learning rate for Adam (or 3e-4, depending on whom you ask) based on empirical experience. While this is generally a good rule of thumb, it's not universally applicable, and experimenting with different values can sometimes be beneficial.

As for why a ratio of r=256 to alpha=128 is preferable over, for instance, r=512 & alpha=256 or r=128 & alpha=64, I can only speculate. For this specific dataset and model combination, it seems to provide the ideal balance of weight and weight update strengths. With a larger r, there might be more overfitting, and with smaller r values, more underfitting could occur. However, this hypothesis requires further investigation.

As someone previously mentioned above (based on a quote from the LoRA paper), this problem is more or less similar to scaling the learning rate, but it's not entirely the same. The learning rate only affects training, whereas alpha also influences the magnitude of the weights during inference.

Nov 19, 2023

thank you for this great article !

Nov 19, 2023

Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))

Glad this was helpful!

Nov 19, 2023

Great article

Apr 13

hi. what do you mean by `static dataset`?

Good point, I could have been more clear on that. I meant datasets that are fixed and are not updated with more examples or new info.

Mar 5


Feb 16

Great article and well written. Thanks a lot!

Thanks for the kind words!

Feb 14

Regarding the 8th point about training a 7B model in a single GPU,

I got a batch size of 1 with context length 256 on a 16GB GPU.

Is this consistent with your experience as well or am I missing anything?

Feb 14

This could work. If the memory consumption is still too high, you could try double-quantization in QLoRA. I.e., setting `--quantize bnb.fp4-dq` instead of `--quantize bnb.fp4`

I found the issue 😅

You're also using a batch size of 1, but a gradient accumulation of 128, that's why the 128 batch size effectively.

I used hugging face, and the difference in notation confused me a bit.

Also I was using double quantization already

Thanks for your help

Jan 12

Thanks for the great article! This helped me a lot :)

I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.

As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?

I want to know your opinion thanks :)

These are good points, and I unfortunately don't have a good / definitive answer. I think the "number of epochs" is still an open problem. E.g., in the TinyLlama paper (https://arxiv.org/abs/2401.02385) that just came out, they also found that multiple epochs are beneficial for pretraining as well (counter to what people previously though).

So, for now, I'd just try multiple epochs and see what happens on benchmark datasets. What you could do is just keep training and save a model checkpoint after each epoch and then see if it continues to improve on benchmark tasks.

Overall, as a rule of thumb, re "Is it bad to train the model with small dataset size + high epoch?" I'd say the smaller the dataset and the larger the epoch number, the higher the risk of overfitting. Adding more data can reduce overfitting, but so can weight decay (in AdamW), dropout, etc.

The bottom line is, it requires experimentation to find out.

Thank you so much for kind reply!! Really appreciate it :)

Dec 12, 2023

thank you for sharing your insights.

i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.

i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?

Yes, I think r=256 can be very high and the mileage may vary depending on the LLM and dataset. Like a learning rate or batch size, I recommend revisiting it for each new project (new LLM and/or new dataset). However, I think that the alpha = 2*r is still a good rule of thumb though.

How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?

I didn't do any experiments with 8-bit quantization here but directly went to 4-bit quantization to take advantage of the NormalFloat datatype that was introduced in the QLoRA paper. Table 3 in the QLoRA paper has a comparison with int8, but it looks like it's basically the same modeling performance as 4-bit, so maybe not worthwhile exploring: https://arxiv.org/abs/2305.14314

Do you have notebook for this?

Sorry, I don't have a notebook for this. But good news is that we greatly simplified the LitGPT interface so it's pretty easy to use and change the LoRA settings now. I've written a short 0 to LitGPT tutorial here: https://github.com/Lightning-AI/litgpt/blob/main/tutorials/0_to_litgpt.md#finetune-llms

Very useful article but I have a question what is the maximum number of samples for fine-tuning llama with Lora and what is the amount of dataset to fine-tune llama with full parameters tuning I experimented with 500k samples I have 4 GPUs v100 16 GB my training loss was well until the training step reach 30000 the training loss is 0.0 and eval is nan what are your suggestions to solve the problem

I would lean towards saying that your model overfits a lot, making the validation loss explode. But a NaN in the loss usually indicates that there's some sort of bug or numerical stability issue. I would try to swap training and validation sets for a quick run and see if you get the same. Or, if you now get a training loss of NaN and a validation loss of a certain high number, that would indicate that there's something not right in the formatting pipeline.

Please reply Sebastian Raschka

