Using Mixed-Precision and Fully Sharded Data Parallelism
Wow Lightning does make it super easy to employ mixed precision and distributed training! When I wrote about mixed precision in 2020, when PyTorch just released their AMP (automatic mixed precision) module, it was a mess trying to autocast the layers and remembering their precisions.
Enjoyed the read! Thanks!
Clear and straight to the point, thanks a lot!
Any specific reasons to prefer bfloat16 to float16?