Once again, this has been an exciting month in AI research. This month, I'm covering two new openly available LLMs, insights into small finetuned LLMs, and a new parameter-efficient LLM finetuning technique. The two LLMs mentioned above stand out for several reasons. One LLM (OLMo) is completely open source, meaning that everything from the training code to the dataset to the log files is openly shared.
Hey I was wondering if you would consider making a blog post sometime as to how you find your papers? Right now my current strategy is by looking at the trending papers from the paperwithcode site, huggingface blog and your blog.
GeGLU vs SwiGLU could just be a “decoy” - a random change added just to make it harder to understand where the gains came from? Not sure, just an idea. There’s not a lot of science behind these hyperparameter choices unfortunately.
Thanks a lot for a great write up (as always). I just wanted to whether you would be covering training big models using different sharding methodologies (distributed data parallelism/ multi gpu training etc ) in your book ?
Hey I was wondering if you would consider making a blog post sometime as to how you find your papers? Right now my current strategy is by looking at the trending papers from the paperwithcode site, huggingface blog and your blog.
GeGLU vs SwiGLU could just be a “decoy” - a random change added just to make it harder to understand where the gains came from? Not sure, just an idea. There’s not a lot of science behind these hyperparameter choices unfortunately.
Great read and keep up the amazing work!
Your round up is always incredible thanks for sharing!
Fantastic write-up, as usual, thank you!!
Maybe just one super-minor typo about OLMo:
"decay up to the peak learning rate" --> "decay up to a tenth of the peak learning rate"
Very impressive how many papers you manage to get through. Just ordered the book. Thanks
Thanks a lot for a great write up (as always). I just wanted to whether you would be covering training big models using different sharding methodologies (distributed data parallelism/ multi gpu training etc ) in your book ?
As always very well written. 😄
Is the deviation from ReLu’s to SwiGlu and GeGlu, is to make the function smoother instead of piecewise linear when using ReLu?
Congratulations on the book launch! I just put in a pre-order for a copy on Amazon.