Improving LoRA: Implementing…

Sebastian Raschka, PhD

Feb 18, 2024

Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model (for example, an LLM or vision transformer) to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters.

48 Comments

Fantastic write-up, thank you!

Small correction:

"The DoRA two-step process (decomposing a pretrained weight matrix and applying DoRA to the directional matrix) is further illustrated in the figure from the DoRA paper below."

applying DoRA to the directional matrix --> applying LoRA to the directional matrix.

Expand full comment

Sebastian Raschka, PhD

Good catch, thanks!

Expand full comment

layer_lora_1 = LinearWithLoRA(layer, rank=2, alpha=4)

print("LoRA output:", layer_lora_2(x))

minor typo, did you mean ("LoRA output:", layer_lora_1(x))

Thanks for great write-up

Expand full comment

Sebastian Raschka, PhD

Good catch, updated it!

Expand full comment

Yes, it seems to be outdated in the article (in the code in the repo, it is correct)

Expand full comment

Great write-up! Clear, informative, on a super useful topic. Thank you for sharing!!

Expand full comment

Sebastian Raschka, PhD

Thanks for the kind words!

Expand full comment

There is a method for inferring optimal hyperparameters for a network called bayesian optimization. It’s based on the idea that optimal hyperparameters can be found on a certain manifold in a space of hyperparameters. Which is as rigorous as, for example, experimental physics.

This method is being developed for so-called black-box functions. That means, you cannot really predict the outcome of an experiment without carrying it out. In this sense, yes, they are empirically evaluated.

To come up with the best parameters for a given function, researchers run several experiments, mapping a small subset of parameters to cost functions. This is the “empirical” part. Then, under the assumption that there is a certain pattern, cost function values are inferred for other sets of parameters and the predictive model is being corrected. This is an attempt to make the process more “theoretical”.

So, the answer lies somewhere in between. In practice bayesian optimization is not ubiquitous, most of the time the parameters are tuned using a few runs and prior experience in tuning similar architectures - after some time you develop “a sense” of optimal configuration for certain problems. As I’ve been told by my former mentor, all it takes is several thousand trained networks. Piece of cake.

Expand full comment

Sebastian Raschka, PhD

Ah yes, Hyperopt was a pretty important library back then when I was a grad student. But in the end, most people still did manual and exhaustive brute-force hparam search. It was jokingly referred to as “graduate student descent” back then

Expand full comment

Balaji Rudrawar

Thanks For this detailed post. Awesome explaination

Expand full comment

Awesome article!

Small question:

In the LoRA implementation, you are applying the layers to the input and then summing up the results. Wouldn't it be more efficient to sum up the weights of both layers first then multiply the input by result?

It should be valid since in matrix multiplication: x(A + B) is the same as xA + xB where x is a constant and (A and B) are matrices.

Expand full comment

Sebastian Raschka, PhD

Hey there,

you raise a good question, there are multiple ways to implement it. In the article, I had:

```

class LinearWithLoRA(nn.Module):

def __init__(self, linear, rank, alpha):

super().__init__()

self.linear = linear

self.lora = LoRALayer(

linear.in_features, linear.out_features, rank, alpha

)

def forward(self, x):

return self.linear(x) + self.lora(x)

```

and

```

class LinearWithLoRAMerged(nn.Module):

def __init__(self, linear, rank, alpha):

super().__init__()

self.linear = linear

self.lora = LoRALayer(

linear.in_features, linear.out_features, rank, alpha

)

def forward(self, x):

lora = self.lora.A @ self.lora.B # Combine LoRA matrices

# Then combine LoRA with orig. weights

combined_weight = self.linear.weight + self.lora.alpha*lora.T

return F.linear(x, combined_weight, self.linear.bias)

```

Is the second case what you had in mind? I.e, `self.linear.weight + self.lora.alpha*lora.T `?

Expand full comment

My bad! Not sure how I missed the second variation. So in this case should we favor the second variation for less matrix multiplications?

Expand full comment

I have a OOM error in this line:

denominator = numerator.norm(p=2, dim=0, keepdim=True), is it will consume much more GPU memory. how can we handle this?

Expand full comment

Sebastian Raschka, PhD

Hm not sure how to compute this more cheaply. What model were you running this on?

Expand full comment

I use T0-3B model. the LoRA works well in 40G or 80G machine. But DoRA have OOM issue for 40G/ 80G with rank=16. I have to reduce the rank of LoRA.

Expand full comment

Sebastian Raschka, PhD

Oh I see. I am not sure if there is a cheaper way to compute the norm. I guess we will have to wait until the authors finally share the official implementation.

Expand full comment

Thanks for your great writing.

Expand full comment

Himanshu Kapoor

Great write up. Thankyou

Expand full comment

Great write-up! thank you!

Expand full comment

Sebastian Raschka, PhD

Glad to hear! Thanks!

Expand full comment

as always, great post!

Thank you very much!

Expand full comment

Sebastian Raschka, PhD

Glad to hear!

Expand full comment

Trelis Research

Incredible write-up and turn around on this.

Does it seem a bit slow that m is multiplied by both V and delta V? (I guess especially multiplying by V adds steps for the forward propagation). I don't really see a way around this though.

Expand full comment

Sebastian Raschka, PhD

Thanks! The weight norm definitely adds an extra step but that should not be a bottleneck in practice imho. The multiplication should only happen once, too, per update.

Expand full comment

For domains where we have large amounts of raw data (eg: 10 Billion tokens or more), would peft methods like DORA/LORA combined with converting the data to instruction format (eg: AdaptLLM) be sufficient to adapt the model to the new domain or do we have to definitely perform Full Fine-Tuning?

Expand full comment

Sebastian Raschka, PhD

Good question. You can try to use DoRA/LoRA for that but I would do that in a pretraining setting. Note that in instruction-finetuning, you only have 1 model update per input/target text. The target is the input text shifted by 1 like in pretraining, but you don't train on this text iteratively, which is why a model usually doesn't soak up as much knowledge during instruction-finetuning.

Expand full comment

Interesting.

I have never worked directly on large scale pre-training, but how does that work on large texts? Do they perform 1 step (predict next token) for each token in the document as a starting point? So, each document would create number of steps = number of tokens in the document (minus the context size)? Or do people use some form of sampling?

I've only seen nanogpt's training loop : https://github.com/karpathy/nanoGPT/blob/master/train.py#L120 where each batch basically randomly samples the starting points from one large document, but I'm guessing this is not how large industry models are trained.

I feel that there are a lot of articles on fine-tuning but I haven't seen many that go into the finer details of pre-training. I understand it's a bit hard for regular consumers to try out, but such code would still be useful as an educational tool.

Expand full comment

Sebastian Raschka, PhD

Feb 21, 2024Edited

I think it depends on how things are implemented as large datasets may be distributed across different machines. In my book, I am scanning over the document using a PyTorch DataLoader like you described (note that I use shuffle=True option though to randomize access to prevent overfitting).

NanoGPT might do the random sampling to avoid using the DataLoader -- as I understand it aims to minimize use of other tools. From a quick look, it looks like it's doing sampling with replacement so you may get some samples multiple times and others not at all. Both approaches would work though in practice as it's more about seeing large amounts of data.

Expand full comment

Sebastian Raschka, PhD

Oh, and regarding how others train LLMs, that's a good point. Most people don't share these details and only release the inference code. But one project that comes to mind is the recent OLMo. I haven't had the chance to look into the details of their pretraining code but you can find and experiment with it here: https://github.com/allenai/OLMo/blob/main/scripts/train.py

Expand full comment

Thanks for taking the time to respond!

Spent some time reading through the OLMo repo, but I couldn't find any specific code in the dataset that shows repeated batching.

I could see the prepare_memmap_dataset file which loads one jsonl line (probably a document) into the dataset as one input point, and I think the Trainer only sees it once, but I'm not sure.

But the repository was pretty useful to learn some other aspects of the pretraining process.

--

For fun, I played with asking GPT4 about how pre-training works, and according to it, people sample different starting points and ensure there's overlap so that the model can learn across long documents, but I don't know if that's reliable.

Expand full comment

Indeed good article. @rasbt I am surprise by result too, I would also be curious if we just do weight normalization of standard LORA itself?

Expand full comment

Sebastian Raschka, PhD

That's a good question; I feel like this would have been a useful ablation study in the paper

Expand full comment

Awesome 🤩 l was wondering can LoRa/DoRa be applied to vision models or object detection models as well.

Since pretrained models know how to detect / classify , it doesn’t know what to detect (out-of-distribution Concepts).

Expand full comment

Sebastian Raschka, PhD

Yes, similar to LoRA it can also be applied to vision transformers, diffusion models, etc. Actually, the DoRA paper included experiments with vision models as well.

Expand full comment

Great explanation, thank you!

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts