Things I Learned From Hundreds of Experiments
Small correction: There was originally a drop from 0.783 to 0.028 for "All-layer QLORA" in the causative benchmark, which seemed like a significant drop that went unmentioned in my text.
This was because I was looking at the correct numbers in my notes but had an incorrect number in the table figure I prepared for the post. In reality, "All-Layer QLoRA" actually improves the benchmark: from 0.783 to 0.788. I have updated the table.
The article was very well written
Are the weights decomposed using PCA?
When we do this kind of experiments for fine tuning hyperparameters, are we supposed to repeat the training for several times with different seeds and take the average weights?
In section "Balancing LoRA Hyperparameters: R and Alpha", the setting of r= 256 and alpha = 128 obvious get the best performance. Why ?
thank you for this great article !
Thanks for the tip about the memory requirements for longer sequence lengths! I have been trying to debug a cuda memory issue for the dolly-15k dataset when running on a 24GB GPU and it's useful to have my thinking confirmed :))
How would you compare the performance of QLoRA 4bit vs QLoRA 8bit?