Wow Lightning does make it super easy to employ mixed precision and distributed training! When I wrote about mixed precision in 2020, when PyTorch just released their AMP (automatic mixed precision) module, it was a mess trying to autocast the layers and remembering their precisions.
Good question, in many situations, I find that both of them work well. However, when I finetuned LLaMA models, for example, float16 gave really poor performance. I think that might have something to do with not well normalized activations or gradients. Because bfloat16 has a larger range of values it can display (but the precision is lower). Or in other words, there may have been values that exceeded -77k or 77k, and that would then cause problems in float16.
Wow Lightning does make it super easy to employ mixed precision and distributed training! When I wrote about mixed precision in 2020, when PyTorch just released their AMP (automatic mixed precision) module, it was a mess trying to autocast the layers and remembering their precisions.
Enjoyed the read! Thanks!
Glad to hear this is useful! And yeah, internally it's using PyTorch's AMP, but it's making the API more user-friendly :)
Clear and straight to the point, thanks a lot!
Any specific reasons to prefer bfloat16 to float16?
Good question, in many situations, I find that both of them work well. However, when I finetuned LLaMA models, for example, float16 gave really poor performance. I think that might have something to do with not well normalized activations or gradients. Because bfloat16 has a larger range of values it can display (but the precision is lower). Or in other words, there may have been values that exceeded -77k or 77k, and that would then cause problems in float16.