Thanks Seb. I have always wanted to read these summaries as one couldn’t keep up with these esp when one’s not actively working in this field. Your summaries helps to get Birds Eye view on all the major developments and stay relevant and for free. Many thanks 🙏
Happy New Year! We hope you recover quickly and stay in great health! I'm Mengge Zheng, an editor from Turing Publishing, part of the People's Posts and Telecommunications Press. The Chinese edition of your book "Build a Large Language Model (From Scratch) "will be hitting the shelves soon! Additionally, we’ve translated this article into Chinese to share with readers in China. It's truly fantastic content!
Happy New Year! If I remember correctly, the publisher mentioned that there were translations in the works, but I didn't know it would happen so quickly. That's great news!
Thank You Sebastian. Build a Large Language Model (From Scratch) help me a lot especiallyto understand all the research papers I was reading, it helped me a lot to logically understand the concepts.
Thanks Sebastian for writing another great summary of literatures! In section 3 (continual pretraining), you mentioned that continual pretraining is a good way to extend an existing LLM to new knowledge; and in section 5 (Lora), you also mentioned that full-finetuning is good at adapting an existing LLM to a new domain. It sounds like both cont’d pretraining and full-finetuning are capable of augmenting an LLM’s knowledge base.
As someone who has not done either continual pretraining nor full-finetuning due to resource constraints, I’m wondering if, for instance, I had a pretrained LLM and would like to make it capable of chatting about medical related topics (or any distinct domain), and if I had all the GPUs, would it be better to perform continual pretraining or full finetuning? Or should I do both?
Good question. The training algorithm in continued pretraining and full finetuning is the same (in my Build an LLM from Scratch book, I designed the code so that you can reuse the training function from the pretraining chapter in the later instruction finetuning chapter to show this). And to your question, I would do both. The pretraining to take up the knowledge and then the finetuning on Q and A data to preserve the "chatting" capabilities. (Another way to do it would be to use another model to reformat the pretraining data as Q and A data and only do full finetuning; if you don't have massive amounts of data where it would be infeasible, then this is also a good option.)
As for the surprisingly slow adoption of sparse MoEs, I believe this is because it's tricky to train them on GPUs with standard ML compilers because of the inherent expert imbalance. Without additional tricks such as load balancing losses, expert parallelism, capacity factor tuning, etc, it's hard to get the performance gains claimed e.g. in the Switch Transformer paper.
I do believe the right way to get sparse MoE to work is with its own compute primitives and dedicated compilers. Check out the MegaBlocks paper, which I think has really nailed the implementation.
Sparse MoEs are great in theory but hard to get right in practice on GPUs that really want to run dense operations.
This is a very, very good point. And besides complicating the training, many open-weight LLMs don't explicitly aim for inference performance as they are not directly deployed as a proprietary service by the developers. But yeah, I am curious to see what the Llama 4 architecture will look like...
One comment - I think studying the mistakes of Gemini makes it very clear to me that it's a MoE style. A lot of the times if you ask it to analyze a video. It will say it's just a language model and it can't analyze video, which to me would most likely be the case if it was using a router to wrote a specific modularities and that's where I thought. Oh, this is probably an MOE.
That's an interesting point. Yes, I think that proprietary LLMs (Gemini, GPT 4, etc.) might be MoEs. But yeah, I was focusing only on the ones where we know for sure, the open-weight ones
Thanks Seb. I have always wanted to read these summaries as one couldn’t keep up with these esp when one’s not actively working in this field. Your summaries helps to get Birds Eye view on all the major developments and stay relevant and for free. Many thanks 🙏
Happy New Year! We hope you recover quickly and stay in great health! I'm Mengge Zheng, an editor from Turing Publishing, part of the People's Posts and Telecommunications Press. The Chinese edition of your book "Build a Large Language Model (From Scratch) "will be hitting the shelves soon! Additionally, we’ve translated this article into Chinese to share with readers in China. It's truly fantastic content!
Happy New Year! If I remember correctly, the publisher mentioned that there were translations in the works, but I didn't know it would happen so quickly. That's great news!
Loved it Sir. Your articles are gems. They deliver signals in an ocean of noise. You skillfully combine theory and practice.
Thanks for the kind words!
Thank You Sebastian. Build a Large Language Model (From Scratch) help me a lot especiallyto understand all the research papers I was reading, it helped me a lot to logically understand the concepts.
( think that coding these things from scratch really is an effective way to learn, and I am glad to hear that my book was helpful!
Thanks Sebastian for writing another great summary of literatures! In section 3 (continual pretraining), you mentioned that continual pretraining is a good way to extend an existing LLM to new knowledge; and in section 5 (Lora), you also mentioned that full-finetuning is good at adapting an existing LLM to a new domain. It sounds like both cont’d pretraining and full-finetuning are capable of augmenting an LLM’s knowledge base.
As someone who has not done either continual pretraining nor full-finetuning due to resource constraints, I’m wondering if, for instance, I had a pretrained LLM and would like to make it capable of chatting about medical related topics (or any distinct domain), and if I had all the GPUs, would it be better to perform continual pretraining or full finetuning? Or should I do both?
Good question. The training algorithm in continued pretraining and full finetuning is the same (in my Build an LLM from Scratch book, I designed the code so that you can reuse the training function from the pretraining chapter in the later instruction finetuning chapter to show this). And to your question, I would do both. The pretraining to take up the knowledge and then the finetuning on Q and A data to preserve the "chatting" capabilities. (Another way to do it would be to use another model to reformat the pretraining data as Q and A data and only do full finetuning; if you don't have massive amounts of data where it would be infeasible, then this is also a good option.)
Great write-up!
As for the surprisingly slow adoption of sparse MoEs, I believe this is because it's tricky to train them on GPUs with standard ML compilers because of the inherent expert imbalance. Without additional tricks such as load balancing losses, expert parallelism, capacity factor tuning, etc, it's hard to get the performance gains claimed e.g. in the Switch Transformer paper.
I do believe the right way to get sparse MoE to work is with its own compute primitives and dedicated compilers. Check out the MegaBlocks paper, which I think has really nailed the implementation.
Sparse MoEs are great in theory but hard to get right in practice on GPUs that really want to run dense operations.
This is a very, very good point. And besides complicating the training, many open-weight LLMs don't explicitly aim for inference performance as they are not directly deployed as a proprietary service by the developers. But yeah, I am curious to see what the Llama 4 architecture will look like...
Great read Sebastian.. Looking forward to the second half!
Outstanding post Sebastian! Thank you!!
One comment - I think studying the mistakes of Gemini makes it very clear to me that it's a MoE style. A lot of the times if you ask it to analyze a video. It will say it's just a language model and it can't analyze video, which to me would most likely be the case if it was using a router to wrote a specific modularities and that's where I thought. Oh, this is probably an MOE.
That's an interesting point. Yes, I think that proprietary LLMs (Gemini, GPT 4, etc.) might be MoEs. But yeah, I was focusing only on the ones where we know for sure, the open-weight ones
Glad to see you're feeling better
Thanks, Devansh! Slowly but steadily :)
Great summary! And thank you for the resources, they are certainly helpful!
Similar to PPO vs DPO, aren't MoE models generally less popular because they're harder to post-train?
Yes, it adds a certain amount of complexity, and I think many tools haven't caught up with it (yet).
What a great year for AI. Thanks for the read!
Great summary! Wishing you a speedy recovery and a wonderful start to the New Year!
Thanks, reading your blogs is always a great experience and learning.
On a humourous note, I wonder if I can create a multi agent system to create newsletter like this. Please write a blog on it 😀