Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Fantastic write-up, thank you!
Small correction:
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
layer_lora_1 = LinearWithLoRA(layer, rank=2, alpha=4)
print("LoRA output:", layer_lora_2(x))
minor typo, did you mean ("LoRA output:", layer_lora_1(x))
Thanks for great write-up
Great write-up! Clear, informative, on a super useful topic. Thank you for sharing!!
I have a OOM error in this line:
denominator = numerator.norm(p=2, dim=0, keepdim=True), is it will consume much more GPU memory. how can we handle this?
Thanks for your great writing.
Great write up. Thankyou
Great write-up! thank you!
as always, great post!
Thank you very much!
Incredible write-up and turn around on this.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Indeed good article. @rasbt I am surprise by result too, I would also be curious if we just do weight normalization of standard LORA itself?
Awesome 🤩 l was wondering can LoRa/DoRa be applied to vision models or object detection models as well.
Since pretrained models know how to detect / classify , it doesn’t know what to detect (out-of-distribution Concepts).
Great explanation, thank you!
Seems like LeCun's server is down, currently I cannot download the MNIST datasets from there