43 Comments

Fantastic write-up, thank you!

Small correction:

"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."

applying DoRA to the directional matrix --> applying LoRA to the directional matrix.

Expand full comment

Good catch, thanks!

Expand full comment

layer_lora_1 = LinearWithLoRA(layer, rank=2, alpha=4)

print("LoRA output:", layer_lora_2(x))

minor typo, did you mean ("LoRA output:", layer_lora_1(x))

Thanks for great write-up

Expand full comment

Good catch, updated it!

Expand full comment

Yes, it seems to be outdated in the article (in the code in the repo, it is correct)

Expand full comment

Great write-up! Clear, informative, on a super useful topic. Thank you for sharing!!

Expand full comment

Thanks for the kind words!

Expand full comment

Awesome article!

Small question:

In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?

It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.

Expand full comment

Hey there,

you raise a good question, there are multiple ways to implement it. In the article, I had:

```

class LinearWithLoRA(nn.Module):

def __init__(self, linear, rank, alpha):

super().__init__()

self.linear = linear

self.lora = LoRALayer(

linear.in_features, linear.out_features, rank, alpha

)

def forward(self, x):

return self.linear(x) + self.lora(x)

```

and

```

class LinearWithLoRAMerged(nn.Module):

def __init__(self, linear, rank, alpha):

super().__init__()

self.linear = linear

self.lora = LoRALayer(

linear.in_features, linear.out_features, rank, alpha

)

def forward(self, x):

lora = self.lora.A @ self.lora.B # Combine LoRA matrices

# Then combine LoRA with orig. weights

combined_weight = self.linear.weight + self.lora.alpha*lora.T

return F.linear(x, combined_weight, self.linear.bias)

```

Is the second case what you had in mind? I.e, `self.linear.weight + self.lora.alpha*lora.T `?

Expand full comment

My bad! Not sure how I missed the second variation. So in this case should we favor the second variation for less matrix multiplications?

Expand full comment

I have a OOM error in this line:

denominator = numerator.norm(p=2, dim=0, keepdim=True), is it will consume much more GPU memory. how can we handle this?

Expand full comment

Hm not sure how to compute this more cheaply. What model were you running this on?

Expand full comment

I use T0-3B model. the LoRA works well in 40G or 80G machine. But DoRA have OOM issue for 40G/ 80G with rank=16. I have to reduce the rank of LoRA.

Expand full comment

Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.

Expand full comment

Thanks for your great writing.

Expand full comment

Great write up. Thankyou

Expand full comment

Great write-up! thank you!

Expand full comment

Glad to hear! Thanks!

Expand full comment

as always, great post!

Thank you very much!

Expand full comment

Glad to hear!

Expand full comment

Incredible write-up and turn around on this.

Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.

Expand full comment

Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.

Expand full comment

For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?

Expand full comment

Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.

Expand full comment

Interesting.

I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?

I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.

I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.

Expand full comment

I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).

NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.

Expand full comment

Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py

Expand full comment

Thanks for taking the time to respond!

Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.

I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.

But the repository was pretty useful to learn some other aspects of the pretraining process.

--

For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.

Expand full comment

Indeed good article. @rasbt I am surprise by result too, I would also be curious if we just do weight normalization of standard LORA itself?

Expand full comment

That's a good question; I feel like this would have been a useful ablation study in the paper

Expand full comment

Awesome 🤩 l was wondering can LoRa/DoRa be applied to vision models or object detection models as well.

Since pretrained models know how to detect / classify , it doesn’t know what to detect (out-of-distribution Concepts).

Expand full comment

Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.

Expand full comment

Great explanation, thank you!

Expand full comment

Thanks for your great work, learned a lot!

But I got quite confuse about

self.m = nn.Parameter(

self.linear.weight.norm(p=2, dim=0, keepdim=True))

this means that we need set a trainable tensor which shape is same to the basicWeights. This may loss the advantage of LoRA's variant which only finetuning the less parameters.

By looking for the DoRA source code. We found they manage this step by using the function below:

def get_weight_norm(self, weight, lora_weight, scaling) -> torch.Tensor:

# calculate L2 norm of weight matrix, column-wise

weight = transpose(weight, self.fan_in_fan_out)

weight = weight + scaling * lora_weight

weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype)

return weight_norm

x_eye = torch.eye(lora_A.weight.shape[1], device=lora_A.weight.device, dtype=x.dtype)

lora_weight = lora_B(lora_A(x_eye)).T

They seems that reuse the parameters of LoRAs by insert the basicWeight, LoRA_weigt and scales.

Do you think it is better? It seems that didn't follows the paper which magnitude is also a trainable parameters.

Expand full comment

Thanks for the comment. I think the code you copied is from my DoraMerged variant, which is where the adapters are merged with the Linear weights, which is why it looks like that. There are different ways you can implement LoRA and DoRA, i.e., the merged and the unmerged variant.

Expand full comment

PS:The code was found inside the peft

Expand full comment

Seems like LeCun's server is down, currently I cannot download the MNIST datasets from there

Expand full comment

Arg, that's weird. In the meantime, I uploaded the local copy of MNIST here. You just need to download and place the "data" folder as is next to the notebook: https://drive.google.com/drive/folders/1QbZwHwyHqCMN7RpS5WWRNQSXK0UptKi_?usp=share_link

Expand full comment

Thanks, that worked! Just looked briefly into the code while reading the article, looked good so far!

Btw I have found a minor issue: In the article, I think the provided link after "DoRA is more robust to the rank hyperparameter than LoRA" visualization is not correct, should be the DoRA paper (https://arxiv.org/abs/2402.09353). I have looked this up bc I wanted to compare the avg. accuracy values in the visualization with the ones in the table on p. 19 in the DoRA paper :)

Expand full comment

Thanks for the note, must have copied the wrong arxiv paper URL. Should be fixed now!

Expand full comment

Can you please also add a requirements.txt to the repo? :)

Expand full comment

Sure, just added. It's basically just torch and torchvision for the dataset.

Expand full comment

Thanks! Yeah, it's also about the specific package version for reproducibility of the results, and package functionalities might be renamed or aggregated in the future.

Expand full comment