Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?
It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.
Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.
I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?
I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.
I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.
I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).
NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.
Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py
Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.
I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.
But the repository was pretty useful to learn some other aspects of the pretraining process.
--
For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.
Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.
Fantastic write-up, thank you!
Small correction:
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
Good catch, thanks!
layer_lora_1 = LinearWithLoRA(layer, rank=2, alpha=4)
print("LoRA output:", layer_lora_2(x))
minor typo, did you mean ("LoRA output:", layer_lora_1(x))
Thanks for great write-up
Good catch, updated it!
Yes, it seems to be outdated in the article (in the code in the repo, it is correct)
Great write-up! Clear, informative, on a super useful topic. Thank you for sharing!!
Thanks for the kind words!
Thanks For this detailed post. Awesome explaination
Awesome article!
Small question:
In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?
It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.
Hey there,
you raise a good question, there are multiple ways to implement it. In the article, I had:
```
class LinearWithLoRA(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
return self.linear(x) + self.lora(x)
```
and
```
class LinearWithLoRAMerged(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
lora = self.lora.A @ self.lora.B # Combine LoRA matrices
# Then combine LoRA with orig. weights
combined_weight = self.linear.weight + self.lora.alpha*lora.T
return F.linear(x, combined_weight, self.linear.bias)
```
Is the second case what you had in mind? I.e, `self.linear.weight + self.lora.alpha*lora.T `?
My bad! Not sure how I missed the second variation. So in this case should we favor the second variation for less matrix multiplications?
I have a OOM error in this line:
denominator = numerator.norm(p=2, dim=0, keepdim=True), is it will consume much more GPU memory. how can we handle this?
Hm not sure how to compute this more cheaply. What model were you running this on?
I use T0-3B model. the LoRA works well in 40G or 80G machine. But DoRA have OOM issue for 40G/ 80G with rank=16. I have to reduce the rank of LoRA.
Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.
Thanks for your great writing.
Great write up. Thankyou
Great write-up! thank you!
Glad to hear! Thanks!
as always, great post!
Thank you very much!
Glad to hear!
Incredible write-up and turn around on this.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.
Interesting.
I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?
I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.
I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.
I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).
NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.
Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py
Thanks for taking the time to respond!
Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.
I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.
But the repository was pretty useful to learn some other aspects of the pretraining process.
--
For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.
Indeed good article. @rasbt I am surprise by result too, I would also be curious if we just do weight normalization of standard LORA itself?
That's a good question; I feel like this would have been a useful ablation study in the paper
Awesome 🤩 l was wondering can LoRa/DoRa be applied to vision models or object detection models as well.
Since pretrained models know how to detect / classify , it doesn’t know what to detect (out-of-distribution Concepts).
Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.
Great explanation, thank you!
I would like to ask how I should evaluate the prediction accuracy of large models
Do you mean specifically an LLM trained as a classifier? Then you would essentially use a test set like in regular ML or DL and calculate it as "num correct predictions" / "all predictions". Here's a hands-on example from Chapter 6 of my book: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb