Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?
It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.
Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.
I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?
I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.
I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.
I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).
NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.
Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py
Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.
I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.
But the repository was pretty useful to learn some other aspects of the pretraining process.
--
For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.
Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.
this means that we need set a trainable tensor which shape is same to the basicWeights. This may loss the advantage of LoRA's variant which only finetuning the less parameters.
By looking for the DoRA source code. We found they manage this step by using the function below:
Thanks for the comment. I think the code you copied is from my DoraMerged variant, which is where the adapters are merged with the Linear weights, which is why it looks like that. There are different ways you can implement LoRA and DoRA, i.e., the merged and the unmerged variant.
Feb 18·edited Feb 18Liked by Sebastian Raschka, PhD
Thanks, that worked! Just looked briefly into the code while reading the article, looked good so far!
Btw I have found a minor issue: In the article, I think the provided link after "DoRA is more robust to the rank hyperparameter than LoRA" visualization is not correct, should be the DoRA paper (https://arxiv.org/abs/2402.09353). I have looked this up bc I wanted to compare the avg. accuracy values in the visualization with the ones in the table on p. 19 in the DoRA paper :)
Thanks! Yeah, it's also about the specific package version for reproducibility of the results, and package functionalities might be renamed or aggregated in the future.
Fantastic write-up, thank you!
Small correction:
"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."
applying DoRA to the directional matrix --> applying LoRA to the directional matrix.
Good catch, thanks!
layer_lora_1 = LinearWithLoRA(layer, rank=2, alpha=4)
print("LoRA output:", layer_lora_2(x))
minor typo, did you mean ("LoRA output:", layer_lora_1(x))
Thanks for great write-up
Good catch, updated it!
Yes, it seems to be outdated in the article (in the code in the repo, it is correct)
Great write-up! Clear, informative, on a super useful topic. Thank you for sharing!!
Thanks for the kind words!
Awesome article!
Small question:
In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?
It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.
Hey there,
you raise a good question, there are multiple ways to implement it. In the article, I had:
```
class LinearWithLoRA(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
return self.linear(x) + self.lora(x)
```
and
```
class LinearWithLoRAMerged(nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
lora = self.lora.A @ self.lora.B # Combine LoRA matrices
# Then combine LoRA with orig. weights
combined_weight = self.linear.weight + self.lora.alpha*lora.T
return F.linear(x, combined_weight, self.linear.bias)
```
Is the second case what you had in mind? I.e, `self.linear.weight + self.lora.alpha*lora.T `?
My bad! Not sure how I missed the second variation. So in this case should we favor the second variation for less matrix multiplications?
I have a OOM error in this line:
denominator = numerator.norm(p=2, dim=0, keepdim=True), is it will consume much more GPU memory. how can we handle this?
Hm not sure how to compute this more cheaply. What model were you running this on?
I use T0-3B model. the LoRA works well in 40G or 80G machine. But DoRA have OOM issue for 40G/ 80G with rank=16. I have to reduce the rank of LoRA.
Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.
Thanks for your great writing.
Great write up. Thankyou
Great write-up! thank you!
Glad to hear! Thanks!
as always, great post!
Thank you very much!
Glad to hear!
Incredible write-up and turn around on this.
Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.
Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.
For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?
Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.
Interesting.
I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?
I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.
I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.
I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).
NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.
Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py
Thanks for taking the time to respond!
Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.
I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.
But the repository was pretty useful to learn some other aspects of the pretraining process.
--
For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.
Indeed good article. @rasbt I am surprise by result too, I would also be curious if we just do weight normalization of standard LORA itself?
That's a good question; I feel like this would have been a useful ablation study in the paper
Awesome 🤩 l was wondering can LoRa/DoRa be applied to vision models or object detection models as well.
Since pretrained models know how to detect / classify , it doesn’t know what to detect (out-of-distribution Concepts).
Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.
Great explanation, thank you!
Thanks for your great work, learned a lot!
But I got quite confuse about
self.m = nn.Parameter(
self.linear.weight.norm(p=2, dim=0, keepdim=True))
this means that we need set a trainable tensor which shape is same to the basicWeights. This may loss the advantage of LoRA's variant which only finetuning the less parameters.
By looking for the DoRA source code. We found they manage this step by using the function below:
def get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:
# calculate L2 norm of weight matrix, column-wise
weight = transpose(weight, self.fan_in_fan_out)
weight = weight + scaling * lora_weight
weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype)
return weight_norm
x_eye = torch.eye(lora_A.weight.shape[1], device=lora_A.weight.device, dtype=x.dtype)
lora_weight = lora_B(lora_A(x_eye)).T
They seems that reuse the parameters of LoRAs by insert the basicWeight, LoRA_weigt and scales.
Do you think it is better? It seems that didn't follows the paper which magnitude is also a trainable parameters.
Thanks for the comment. I think the code you copied is from my DoraMerged variant, which is where the adapters are merged with the Linear weights, which is why it looks like that. There are different ways you can implement LoRA and DoRA, i.e., the merged and the unmerged variant.
PS:The code was found inside the peft
Seems like LeCun's server is down, currently I cannot download the MNIST datasets from there
Arg, that's weird. In the meantime, I uploaded the local copy of MNIST here. You just need to download and place the "data" folder as is next to the notebook: https://drive.google.com/drive/folders/1QbZwHwyHqCMN7RpS5WWRNQSXK0UptKi_?usp=share_link
Thanks, that worked! Just looked briefly into the code while reading the article, looked good so far!
Btw I have found a minor issue: In the article, I think the provided link after "DoRA is more robust to the rank hyperparameter than LoRA" visualization is not correct, should be the DoRA paper (https://arxiv.org/abs/2402.09353). I have looked this up bc I wanted to compare the avg. accuracy values in the visualization with the ones in the table on p. 19 in the DoRA paper :)
Thanks for the note, must have copied the wrong arxiv paper URL. Should be fixed now!
Can you please also add a requirements.txt to the repo? :)
Sure, just added. It's basically just torch and torchvision for the dataset.
Thanks! Yeah, it's also about the specific package version for reproducibility of the results, and package functionalities might be renamed or aggregated in the future.