Once again, this has been an exciting month in AI research. This month, I'm covering two new openly available LLMs, insights into small finetuned LLMs, and a new parameter-efficient LLM finetuning technique. The two LLMs mentioned above stand out for several reasons. One LLM (OLMo) is completely open source, meaning that everything from the training code to the dataset to the log files is openly shared.
Hey I was wondering if you would consider making a blog post sometime as to how you find your papers? Right now my current strategy is by looking at the trending papers from the paperwithcode site, huggingface blog and your blog.
Thanks for the suggestion. I can put it onto my list of interesting topics to write about one day. I was a former moderator for the machine learning category (cs.LG) on arxiv, so it's an old habit of mine to scan the submissions, which is what I often (but not daily) do to find interesting papers.
Thanks so much for the reply! I have read every single on of your blogs and read your book: Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. I really appreciate all of the work you put into this field and I want to thank you for it!
GeGLU vs SwiGLU could just be a “decoy” - a random change added just to make it harder to understand where the gains came from? Not sure, just an idea. There’s not a lot of science behind these hyperparameter choices unfortunately.
Ah, yes, it's a lot of work, but when spread out throughout the month, the amount is actually not that scary. There are also many papers that I only briefly skim because reading each paper in detail would indeed by a full time job. Thanks for getting a copy of my book!
Thanks a lot for a great write up (as always). I just wanted to whether you would be covering training big models using different sharding methodologies (distributed data parallelism/ multi gpu training etc ) in your book ?
Hey I was wondering if you would consider making a blog post sometime as to how you find your papers? Right now my current strategy is by looking at the trending papers from the paperwithcode site, huggingface blog and your blog.
Thanks for the suggestion. I can put it onto my list of interesting topics to write about one day. I was a former moderator for the machine learning category (cs.LG) on arxiv, so it's an old habit of mine to scan the submissions, which is what I often (but not daily) do to find interesting papers.
Thanks so much for the reply! I have read every single on of your blogs and read your book: Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2. I really appreciate all of the work you put into this field and I want to thank you for it!
That's a lot of articles! Thanks so much for the kind words. Knowing that these materials are so well received keeps me motivated to write more :)
GeGLU vs SwiGLU could just be a “decoy” - a random change added just to make it harder to understand where the gains came from? Not sure, just an idea. There’s not a lot of science behind these hyperparameter choices unfortunately.
Great read and keep up the amazing work!
Your round up is always incredible thanks for sharing!
Fantastic write-up, as usual, thank you!!
Maybe just one super-minor typo about OLMo:
"decay up to the peak learning rate" --> "decay up to a tenth of the peak learning rate"
Thanks on both accounts. And yes, this was a typo! Just fixed it!
Very impressive how many papers you manage to get through. Just ordered the book. Thanks
Ah, yes, it's a lot of work, but when spread out throughout the month, the amount is actually not that scary. There are also many papers that I only briefly skim because reading each paper in detail would indeed by a full time job. Thanks for getting a copy of my book!
Thanks a lot for a great write up (as always). I just wanted to whether you would be covering training big models using different sharding methodologies (distributed data parallelism/ multi gpu training etc ) in your book ?
As always very well written. 😄
Is the deviation from ReLu’s to SwiGlu and GeGlu, is to make the function smoother instead of piecewise linear when using ReLu?
Congratulations on the book launch! I just put in a pre-order for a copy on Amazon.