42 Comments
author

Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.

This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.

Expand full comment
Nov 20, 2023Liked by Sebastian Raschka, PhD

The article was very well written

Loved it.

Are the weights decomposed using PCA?

Expand full comment
Nov 20, 2023Liked by Sebastian Raschka, PhD

When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?

Expand full comment
Nov 20, 2023Liked by Sebastian Raschka, PhD

In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?

Expand full comment
Nov 19, 2023Liked by Sebastian Raschka, PhD

thank you for this great article !

Expand full comment
Nov 19, 2023Liked by Sebastian Raschka, PhD

Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))

Expand full comment
Nov 19, 2023Liked by Sebastian Raschka, PhD

Great article

Expand full comment
Apr 13Liked by Sebastian Raschka, PhD

hi. what do you mean by `static dataset`?

Expand full comment
Mar 5Liked by Sebastian Raschka, PhD

Wonderful

Expand full comment
Feb 16Liked by Sebastian Raschka, PhD

Great article and well written. Thanks a lot!

Expand full comment
Feb 14Liked by Sebastian Raschka, PhD

Regarding the 8th point about training a 7B model in a single GPU,

I got a batch size of 1 with context length 256 on a 16GB GPU.

Is this consistent with your experience as well or am I missing anything?

Expand full comment
Jan 12Liked by Sebastian Raschka, PhD

Thanks for the great article! This helped me a lot :)

I have one question. I tried fine-tuning mistral 7B instruct v2 with 1000 samples for 5 epoch training and 5000 samples for 1 epoch training. Samples are definetely from same dataset, which consists of same instruction: "[INST] Write appropriate medical impression for following findings: {findings}[/INST]{imprssion}". Actually, 5 epoch training version showed way better performance, for same format instruction test sample.

As you mentioned LIMA in the article, researchers trained the model with 1k samples for 15 epochs. Seeing this, I can't get adequate epoch number related with the dataset size. Is it bad to train the model with small dataset size + high epoch? Large dataset + 1 epoch is good?

I want to know your opinion thanks :)

Expand full comment
Dec 12, 2023Liked by Sebastian Raschka, PhD

thank you for sharing your insights.

i raised the 'r' value to 256 for fine-tuning a 7b Mistral, but it is currently experiencing overfitting.

i just realized that these best practices are likely aimed at enhancing academic benchmarks. perhaps, for most task specific fine-tuning tasks, it would be advisable to set lower values for 'r' and 'lora_alpha'?

Expand full comment

How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?

Expand full comment

Do you have notebook for this?

Expand full comment

Very useful article but I have a question what is the maximum number of samples for fine-tuning llama with Lora and what is the amount of dataset to fine-tune llama with full parameters tuning I experimented with 500k samples I have 4 GPUs v100 16 GB my training loss was well until the training step reach 30000 the training loss is 0.0 and eval is nan what are your suggestions to solve the problem

Expand full comment