Noteworthy AI Research Papers of 2024 (Part…

Sebastian Raschka, PhD

Dec 31, 2024

354

Six influential AI papers from January to June

Read →

28 Comments

Viswa Kumar

Jan 1Edited

Thanks Seb. I have always wanted to read these summaries as one couldn’t keep up with these esp when one’s not actively working in this field. Your summaries helps to get Birds Eye view on all the major developments and stay relevant and for free. Many thanks 🙏

Expand full comment

Weipen Zhou

Jan 13

Thank you Dr. Seb. I am Weipeng, a PostDoc at Yale School of Medicine. I was looking at your website for another reason but was attracted by this article and it was well written!

For my original purpose, I was using litgpt for pretraining a clinical LLM from scratch. The pretraining went very well but I encountered problem converting my trained model to huggingface format. I raised 2 issues on github but I was not getting responses for a while. Do you happen to know why?

Here are the two issues

https://github.com/Lightning-AI/litgpt/issues/1871

https://github.com/Lightning-AI/litgpt/issues/1854

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 21

Hi Weipen, thanks for your comment and sorry about the hassle. Unfortunatley, I wasn't able to respond to the GitHub issues as I had a travel accident in Nov and just returned to work early Jan.

Unfortunately, also I don't know why this might be happening. I usually don't work with HF, and this was a function that was never really prioritized since we prefer to keep the weights in LitGPT format (it could also simply be a change in the transformer library with a recent version update). Unfortunately, due to a shift in obligations, I am also currently not able to look into this (I am not one of the maintainers of LitGPT anymore due to a shift to other priorities.)

Expand full comment

Eric

Feb 23

Thanks for the insightful overview of AI research papers in 2024! I really enjoyed reading your entire article.

However, I noticed a potential inaccuracy: in Section 1.1, the text states that the Router redirects each token embedding to 8 expert feed-forward modules, with their outputs summed. From my understanding, sparse MOE models like Mixtral typically route each token to only a subset of experts (e.g., 2 in Mixtral 8x7B), and only those selected outputs are weighted and summed, not all 8. I thought I'd mention this in case it's worth clarifying.

Thanks again for the great work—I'm looking forward to reading more!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Feb 27

Thanks for the kind words and pointing out this paragraph. I agree that it may sound confusing. I think that's because I truncated it a bit. The part that "The outputs from these 8 expert feed-forward layers are then summed" is correct though. The router directs the outputs to all experts, but then the gate in the weighted summation function is zero in all but 2 cases. I.e., the gaiting weights may be: [0, 0, 0.63, 0, 0, 0.37, 0, 0]. It should be more clear in my older article: https://magazine.sebastianraschka.com/p/research-papers-in-january-2024?open=false#§mixtral-of-experts

But I agree with you that it does sound misleading, and I will update it to be more clear here. Thanks for pointing it out!

Expand full comment

Mengge Zheng

Jan 3

Happy New Year! We hope you recover quickly and stay in great health! I'm Mengge Zheng, an editor from Turing Publishing, part of the People's Posts and Telecommunications Press. The Chinese edition of your book "Build a Large Language Model (From Scratch) "will be hitting the shelves soon! Additionally, we’ve translated this article into Chinese to share with readers in China. It's truly fantastic content!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 3

Happy New Year! If I remember correctly, the publisher mentioned that there were translations in the works, but I didn't know it would happen so quickly. That's great news!

Expand full comment

Reply (1)

Mengge Zheng

Jan 8

Yes, it’s expected to hit the shelves in February and reach readers soon! Even before its release, it has already gained a lot of love from Chinese readers. I’m really looking forward to interacting with you when it’s out! How have you been feeling lately? Have you fully recovered?

Expand full comment

Mahmoud

Jan 2

Loved it Sir. Your articles are gems. They deliver signals in an ocean of noise. You skillfully combine theory and practice.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 2

Thanks for the kind words!

Expand full comment

Tinku

Jan 1

Thank You Sebastian. Build a Large Language Model (From Scratch) help me a lot especiallyto understand all the research papers I was reading, it helped me a lot to logically understand the concepts.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 2

( think that coding these things from scratch really is an effective way to learn, and I am glad to hear that my book was helpful!

Expand full comment

Caron

Jan 1Edited

Thanks Sebastian for writing another great summary of literatures! In section 3 (continual pretraining), you mentioned that continual pretraining is a good way to extend an existing LLM to new knowledge; and in section 5 (Lora), you also mentioned that full-finetuning is good at adapting an existing LLM to a new domain. It sounds like both cont’d pretraining and full-finetuning are capable of augmenting an LLM’s knowledge base.

As someone who has not done either continual pretraining nor full-finetuning due to resource constraints, I’m wondering if, for instance, I had a pretrained LLM and would like to make it capable of chatting about medical related topics (or any distinct domain), and if I had all the GPUs, would it be better to perform continual pretraining or full finetuning? Or should I do both?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 1

Good question. The training algorithm in continued pretraining and full finetuning is the same (in my Build an LLM from Scratch book, I designed the code so that you can reuse the training function from the pretraining chapter in the later instruction finetuning chapter to show this). And to your question, I would do both. The pretraining to take up the knowledge and then the finetuning on Q and A data to preserve the "chatting" capabilities. (Another way to do it would be to use another model to reformat the pretraining data as Q and A data and only do full finetuning; if you don't have massive amounts of data where it would be infeasible, then this is also a good option.)

Expand full comment

Samuel Flender

Jan 1Edited

Great write-up!

As for the surprisingly slow adoption of sparse MoEs, I believe this is because it's tricky to train them on GPUs with standard ML compilers because of the inherent expert imbalance. Without additional tricks such as load balancing losses, expert parallelism, capacity factor tuning, etc, it's hard to get the performance gains claimed e.g. in the Switch Transformer paper.

I do believe the right way to get sparse MoE to work is with its own compute primitives and dedicated compilers. Check out the MegaBlocks paper, which I think has really nailed the implementation.

Sparse MoEs are great in theory but hard to get right in practice on GPUs that really want to run dense operations.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 1

This is a very, very good point. And besides complicating the training, many open-weight LLMs don't explicitly aim for inference performance as they are not directly deployed as a proprietary service by the developers. But yeah, I am curious to see what the Llama 4 architecture will look like...

Expand full comment

Jan 1

Great read Sebastian.. Looking forward to the second half!

Expand full comment

Julia MacDonald

Dec 31

Outstanding post Sebastian! Thank you!!

Expand full comment

Devansh

Dec 31

One comment - I think studying the mistakes of Gemini makes it very clear to me that it's a MoE style. A lot of the times if you ask it to analyze a video. It will say it's just a language model and it can't analyze video, which to me would most likely be the case if it was using a router to wrote a specific modularities and that's where I thought. Oh, this is probably an MOE.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jan 1

That's an interesting point. Yes, I think that proprietary LLMs (Gemini, GPT 4, etc.) might be MoEs. But yeah, I was focusing only on the ones where we know for sure, the open-weight ones

Expand full comment