12 Comments

Congrats

Expand full comment

Wow, nice summary!

(fyi the link to the "Label Supervised LLaMA Finetuning" paper has an empty space)

Expand full comment

Thanks! (And not anymore :D)

Expand full comment

Highly recommended. I needed to max the dram on my laptop from 16GB to 64GB, fwiw.

Expand full comment

I was able to run the code in the whole book locally with 32GB of RAM on Windows, even the GPT-2 XL model (but this was hard at my hardware limit). When running it inside the Docker container, I needed 8-12 GB of RAM additionally.

Expand full comment

That's good to know, thanks for sharing!

Expand full comment

Oh interesting. I was testing all code on my MacBook Air (24 GB RAM) where it worked fine tbh. Maybe depends on the OS.

Expand full comment

Got the book yesterday... :^D

But what hardware is required/do you suggest?

Expand full comment

Nice timing! I hope you have a fun weekend ahead! The code in the main chapters of this book is should run on conventional laptops within a reasonable timeframe. Additionally, chapter 5 to 7 automatically run on a GPU if available.

Expand full comment

Thank you for the article!!

About the padding vs not-padding: Can we keep the padding but instead of always taking the output from the last token, we take the output from the last token of the sequence for each sequence in the batch? Wondering if you tried that and if it helps? Thanks.

Expand full comment

Hi there! When you say " we take the output from the last token of the sequence" you mean the last token that is not a padding token? Yes, this would be similar to using the batch size of 1 with gradient accumulation like in the code. I.e. the results would mathematically be the same. However, it would require restructuring the code somewhat, which is why I opted for the simpler solution here that gives the same results.

Expand full comment

Oh yes, I meant the same but should have clarified better (I was thinking of the input text sequence in my head). And yes, I thought it would be equivalent but would allow for higher batch size for GPU efficiency. It would need the code to change, hopefully only at the output layer to choose the appropriate index instead of the last one. It should be easier than say a custom mask to ignore padding, so was wondering it there's a catch. For the article itself, keeping things simple with batch size of 1 makes sense. Thank you for responding and for the awesome articles too, I am going through more of them.

Expand full comment