Accelerating PyTorch Model Training

Jun 26, 2023

Using Mixed-Precision and Fully Sharded Data Parallelism

5 Comments

Jun 26, 2023

Wow Lightning does make it super easy to employ mixed precision and distributed training! When I wrote about mixed precision in 2020, when PyTorch just released their AMP (automatic mixed precision) module, it was a mess trying to autocast the layers and remembering their precisions.

Enjoyed the read! Thanks!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jun 26, 2023

Glad to hear this is useful! And yeah, internally it's using PyTorch's AMP, but it's making the API more user-friendly :)

Expand full comment

Ahmed Besbes

Aug 24, 2023

Clear and straight to the point, thanks a lot!

Expand full comment

Rabin Adhikari

Jun 28, 2023

Any specific reasons to prefer bfloat16 to float16?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jun 28, 2023Edited

Good question, in many situations, I find that both of them work well. However, when I finetuned LLaMA models, for example, float16 gave really poor performance. I think that might have something to do with not well normalized activations or gradients. Because bfloat16 has a larger range of values it can display (but the precision is lower). Or in other words, there may have been values that exceeded -77k or 77k, and that would then cause problems in float16.

Expand full comment