Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.
This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.
When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?
Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))
Thanks for the great article! This helped me a lot :)
I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.
As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?
i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.
i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?
Very useful article but I have a question what is the maximum number of samples for fine-tuning llama with Lora and what is the amount of dataset to fine-tune llama with full parameters tuning I experimented with 500k samples I have 4 GPUs v100 16 GB my training loss was well until the training step reach 30000 the training loss is 0.0 and eval is nan what are your suggestions to solve the problem
Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.
This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.
The article was very well written
Loved it.
Are the weights decomposed using PCA?
When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?
In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?
thank you for this great article !
Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))
Great article
hi. what do you mean by `static dataset`?
Wonderful
Great article and well written. Thanks a lot!
Regarding the 8th point about training a 7B model in a single GPU,
I got a batch size of 1 with context length 256 on a 16GB GPU.
Is this consistent with your experience as well or am I missing anything?
Thanks for the great article! This helped me a lot :)
I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.
As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?
I want to know your opinion thanks :)
thank you for sharing your insights.
i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.
i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?
How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?
Do you have notebook for this?
Very useful article but I have a question what is the maximum number of samples for fine-tuning llama with Lora and what is the amount of dataset to fine-tune llama with full parameters tuning I experimented with 500k samples I have 4 GPUs v100 16 GB my training loss was well until the training step reach 30000 the training loss is 0.0 and eval is nan what are your suggestions to solve the problem