Sep 21·edited Sep 21Liked by Sebastian Raschka, PhD
I was able to run the code in the whole book locally with 32GB of RAM on Windows, even the GPT-2 XL model (but this was hard at my hardware limit). When running it inside the Docker container, I needed 8-12 GB of RAM additionally.
Nice timing! I hope you have a fun weekend ahead! The code in the main chapters of this book is should run on conventional laptops within a reasonable timeframe. Additionally, chapter 5 to 7 automatically run on a GPU if available.
About the padding vs not-padding: Can we keep the padding but instead of always taking the output from the last token, we take the output from the last token of the sequence for each sequence in the batch? Wondering if you tried that and if it helps? Thanks.
Hi there! When you say " we take the output from the last token of the sequence" you mean the last token that is not a padding token? Yes, this would be similar to using the batch size of 1 with gradient accumulation like in the code. I.e. the results would mathematically be the same. However, it would require restructuring the code somewhat, which is why I opted for the simpler solution here that gives the same results.
Oh yes, I meant the same but should have clarified better (I was thinking of the input text sequence in my head). And yes, I thought it would be equivalent but would allow for higher batch size for GPU efficiency. It would need the code to change, hopefully only at the output layer to choose the appropriate index instead of the last one. It should be easier than say a custom mask to ignore padding, so was wondering it there's a catch. For the article itself, keeping things simple with batch size of 1 makes sense. Thank you for responding and for the awesome articles too, I am going through more of them.
Congrats
Wow, nice summary!
(fyi the link to the "Label Supervised LLaMA Finetuning" paper has an empty space)
Thanks! (And not anymore :D)
Highly recommended. I needed to max the dram on my laptop from 16GB to 64GB, fwiw.
I was able to run the code in the whole book locally with 32GB of RAM on Windows, even the GPT-2 XL model (but this was hard at my hardware limit). When running it inside the Docker container, I needed 8-12 GB of RAM additionally.
That's good to know, thanks for sharing!
Oh interesting. I was testing all code on my MacBook Air (24 GB RAM) where it worked fine tbh. Maybe depends on the OS.
Got the book yesterday... :^D
But what hardware is required/do you suggest?
Nice timing! I hope you have a fun weekend ahead! The code in the main chapters of this book is should run on conventional laptops within a reasonable timeframe. Additionally, chapter 5 to 7 automatically run on a GPU if available.
Thank you for the article!!
About the padding vs not-padding: Can we keep the padding but instead of always taking the output from the last token, we take the output from the last token of the sequence for each sequence in the batch? Wondering if you tried that and if it helps? Thanks.
Hi there! When you say " we take the output from the last token of the sequence" you mean the last token that is not a padding token? Yes, this would be similar to using the batch size of 1 with gradient accumulation like in the code. I.e. the results would mathematically be the same. However, it would require restructuring the code somewhat, which is why I opted for the simpler solution here that gives the same results.
Oh yes, I meant the same but should have clarified better (I was thinking of the input text sequence in my head). And yes, I thought it would be equivalent but would allow for higher batch size for GPU efficiency. It would need the code to change, hopefully only at the output layer to choose the appropriate index instead of the last one. It should be easier than say a custom mask to ignore padding, so was wondering it there's a catch. For the article itself, keeping things simple with batch size of 1 makes sense. Thank you for responding and for the awesome articles too, I am going through more of them.