13 Comments
Mar 4Liked by Sebastian Raschka, PhD

Hey I was wondering if you would consider making a blog post sometime as to how you find your papers? Right now my current strategy is by looking at the trending papers from the paperwithcode site, huggingface blog and your blog.

Expand full comment
Mar 3Liked by Sebastian Raschka, PhD

GeGLU vs SwiGLU could just be a “decoy” - a random change added just to make it harder to understand where the gains came from? Not sure, just an idea. There’s not a lot of science behind these hyperparameter choices unfortunately.

Great read and keep up the amazing work!

Expand full comment
Mar 3Liked by Sebastian Raschka, PhD

Your round up is always incredible thanks for sharing!

Expand full comment
Mar 3Liked by Sebastian Raschka, PhD

Fantastic write-up, as usual, thank you!!

Maybe just one super-minor typo about OLMo:

"decay up to the peak learning rate" --> "decay up to a tenth of the peak learning rate"

Expand full comment
Mar 3Liked by Sebastian Raschka, PhD

Very impressive how many papers you manage to get through. Just ordered the book. Thanks

Expand full comment

Thanks a lot for a great write up (as always). I just wanted to whether you would be covering training big models using different sharding methodologies (distributed data parallelism/ multi gpu training etc ) in your book ?

Expand full comment

As always very well written. 😄

Is the deviation from ReLu’s to SwiGlu and GeGlu, is to make the function smoother instead of piecewise linear when using ReLu?

Expand full comment

Congratulations on the book launch! I just put in a pre-order for a copy on Amazon.

Expand full comment