23 Comments

Thanks Seb. I have always wanted to read these summaries as one couldn’t keep up with these esp when one’s not actively working in this field. Your summaries helps to get Birds Eye view on all the major developments and stay relevant and for free. Many thanks 🙏

Expand full comment

Happy New Year! We hope you recover quickly and stay in great health! I'm Mengge Zheng, an editor from Turing Publishing, part of the People's Posts and Telecommunications Press. The Chinese edition of your book "Build a Large Language Model (From Scratch) "will be hitting the shelves soon! Additionally, we’ve translated this article into Chinese to share with readers in China. It's truly fantastic content!

Expand full comment

Happy New Year! If I remember correctly, the publisher mentioned that there were translations in the works, but I didn't know it would happen so quickly. That's great news!

Expand full comment

Loved it Sir. Your articles are gems. They deliver signals in an ocean of noise. You skillfully combine theory and practice.

Expand full comment

Thanks for the kind words!

Expand full comment

Thank You Sebastian. Build a Large Language Model (From Scratch) help me a lot especiallyto understand all the research papers I was reading, it helped me a lot to logically understand the concepts.

Expand full comment

( think that coding these things from scratch really is an effective way to learn, and I am glad to hear that my book was helpful!

Expand full comment
6dEdited

Thanks Sebastian for writing another great summary of literatures! In section 3 (continual pretraining), you mentioned that continual pretraining is a good way to extend an existing LLM to new knowledge; and in section 5 (Lora), you also mentioned that full-finetuning is good at adapting an existing LLM to a new domain. It sounds like both cont’d pretraining and full-finetuning are capable of augmenting an LLM’s knowledge base.

As someone who has not done either continual pretraining nor full-finetuning due to resource constraints, I’m wondering if, for instance, I had a pretrained LLM and would like to make it capable of chatting about medical related topics (or any distinct domain), and if I had all the GPUs, would it be better to perform continual pretraining or full finetuning? Or should I do both?

Expand full comment

Good question. The training algorithm in continued pretraining and full finetuning is the same (in my Build an LLM from Scratch book, I designed the code so that you can reuse the training function from the pretraining chapter in the later instruction finetuning chapter to show this). And to your question, I would do both. The pretraining to take up the knowledge and then the finetuning on Q and A data to preserve the "chatting" capabilities. (Another way to do it would be to use another model to reformat the pretraining data as Q and A data and only do full finetuning; if you don't have massive amounts of data where it would be infeasible, then this is also a good option.)

Expand full comment

Great write-up!

As for the surprisingly slow adoption of sparse MoEs, I believe this is because it's tricky to train them on GPUs with standard ML compilers because of the inherent expert imbalance. Without additional tricks such as load balancing losses, expert parallelism, capacity factor tuning, etc, it's hard to get the performance gains claimed e.g. in the Switch Transformer paper.

I do believe the right way to get sparse MoE to work is with its own compute primitives and dedicated compilers. Check out the MegaBlocks paper, which I think has really nailed the implementation.

Sparse MoEs are great in theory but hard to get right in practice on GPUs that really want to run dense operations.

Expand full comment

This is a very, very good point. And besides complicating the training, many open-weight LLMs don't explicitly aim for inference performance as they are not directly deployed as a proprietary service by the developers. But yeah, I am curious to see what the Llama 4 architecture will look like...

Expand full comment

Great read Sebastian.. Looking forward to the second half!

Expand full comment

Outstanding post Sebastian! Thank you!!

Expand full comment

One comment - I think studying the mistakes of Gemini makes it very clear to me that it's a MoE style. A lot of the times if you ask it to analyze a video. It will say it's just a language model and it can't analyze video, which to me would most likely be the case if it was using a router to wrote a specific modularities and that's where I thought. Oh, this is probably an MOE.

Expand full comment

That's an interesting point. Yes, I think that proprietary LLMs (Gemini, GPT 4, etc.) might be MoEs. But yeah, I was focusing only on the ones where we know for sure, the open-weight ones

Expand full comment

Glad to see you're feeling better

Expand full comment

Thanks, Devansh! Slowly but steadily :)

Expand full comment

Great summary! And thank you for the resources, they are certainly helpful!

Expand full comment

Similar to PPO vs DPO, aren't MoE models generally less popular because they're harder to post-train?

Expand full comment

Yes, it adds a certain amount of complexity, and I think many tools haven't caught up with it (yet).

Expand full comment

What a great year for AI. Thanks for the read!

Expand full comment

Great summary! Wishing you a speedy recovery and a wonderful start to the New Year!

Expand full comment

Thanks, reading your blogs is always a great experience and learning.

On a humourous note, I wonder if I can create a multi agent system to create newsletter like this. Please write a blog on it 😀

Expand full comment