Thanks! Unfortunately I don't have a PDF version. However, I just checked and the browser's PDF export function seems to produce reasonable results if that helps
Great article. I often get the crux just by your visualizations and then read on to reconfirm my understanding. Kudos to your visualizations. I really love them.
One question: I wonder these 2 methods also applicable to other non text modalities (video or audio) as well ?
Glad to hear! And yes, they are applicable to audio and video as well. In that case, you would replace the image encoder by an audio encoder for example.
Another excellent article, Sebastian!! Thanks for writing. And I love the care and attention you bring to everything (e.g., the regular attention and cross-attention pictures). And I love how you interleave text and code explanations - your writing is itself multimodal :-).
A meta question: Given the time investment needed to produce this high-quality work, how do you decide what to work on next?
Thanks for the kind words! It's a big time investment actually, so to answer your question, I only do this for topics I really care about or am currently excited about :)
The only thing that remains unclear to me is the differences in the stages of multimodal training process. That is, what is the difference between pretrain and sft stages when training multimodal models? Perhaps I did not ask the correct question, I just got confused, also considering the fact that different authors train and freeze different parts of the models at different stages)
It is also not entirely clear what the data looks like and what it is at different stages of training.
Thanks, and these are good questions. The data options consist of text-only (for LLM pretraining) image-only or image-text pairs for the image encoder pretraining. Image-text for the training of the multimodal system, and visual Q&A for finetuning. But yeah, it's a whole topic in itself and the article is already so long, but it would be a good topic for a "Training Multimodal LLMs" article in the future :)
"As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 parameters.)"
"Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article, URL:), they follow up with instruction and preference finetuning."
As the article is quite long, I would like to suggest that it would be great if you could also add a TOC to the beginning of the articles as this would make navigation in the text more convenient for readers :)
On Emu3: Next-Token Prediction is All You Need presents a significant contribution to the field of image generation. LlamaGen is a compelling development, showing that vanilla autoregressive models, like Llama, can achieve strong results in image generation without specialized visual biases—given the right scale. The comprehensive approach, from optimized tokenizers to scalable models, demonstrates that autoregressive methods can compete with diffusion models like LDM and DiT on ImageNet benchmarks. Plus, the impressive speedup via LLM serving frameworks adds real-world feasibility for faster inference. LlamaGen’s open-source release is a huge win for the community, providing robust resources for advancing visual generation and multimodal models
Key strengths of the paper:
Image tokenizer: The authors develop an image tokenizer that achieves impressive reconstruction quality and codebook usage, showing that discrete representation in image tokenizers is no longer a bottleneck in image reconstruction.
Scalable image generation model: They develop a series of class-conditional image generation models based on the Llama architecture, with the largest model outperforming popular diffusion models on the ImageNet benchmark. This demonstrates that vanilla autoregressive models without inductive biases on visual signals can serve as the basis for effective image generation systems.
High-quality training data: The authors train a text-conditional image generation model on a large dataset and fine-tune it on a smaller dataset of high aesthetic quality images, demonstrating competitive performance in terms of visual quality and text alignment.
Optimized inference speed: By adopting the vLLM framework, the authors achieve significant speedups in the inference speed of their image generation models.
Open-source contribution: The authors release all models and code, making a valuable contribution to the open-source community in visual generation and multimodal foundation models.
A potential area for improvement:
Comparison with state-of-the-art: While the authors acknowledge that their released models are still behind state-of-the-art visual generation models based on diffusion models, it would be interesting to see a more detailed comparison and discussion of the potential for autoregressive models to close this gap with more training data and computational resources.
Overall, this paper makes a significant contribution by demonstrating the potential of autoregressive models in image generation and providing valuable resources to the open-source community. The authors' approach of reducing inductive biases on visual signals and adopting the same architecture as language models is a promising direction for developing unified models between language and vision. The impressive results achieved with their image tokenizer, scalable models, and optimized inference speed highlight the potential of this approach. Well done!
This is an impressive and comprehensive overview of the latest developments in multimodal large language models (LLMs). The author has done an excellent job of explaining the key concepts and methods in an accessible way, while also providing a detailed look at several recent research papers on the topic.
Some key strengths of the article:
Clear explanations of the two main approaches to building multimodal LLMs - the Unified Embedding Decoder Architecture (Method A) and the Cross-Modality Attention Architecture (Method B). The author breaks down these methods with helpful diagrams and code examples.
A thorough review of 10 recent research papers, covering a diverse range of models and approaches. This gives the reader a good sense of the current state-of-the-art in multimodal LLMs.
Useful comparisons and analysis, such as the discussion of the trade-offs between Methods A and B, and the summary table at the end comparing the different models.
Coverage of some novel ideas and approaches, such as the Naive Dynamic Resolution mechanism in Qwen2-VL for handling varying image resolutions, and the Emu3 model for image generation using a transformer decoder.
A few potential areas for improvement:
The article is quite long and dense with information, which might be overwhelming for some readers. Breaking it up into more sections with clear headings could help with readability.
While the author acknowledges the difficulty of comparing model performance, it would still be interesting to see some discussion of how these models perform on standard benchmarks or real-world tasks.
The article focuses primarily on image-text models, but it would be nice to see more discussion of other modalities like video and speech (which are briefly mentioned in the Llama 3 section).
Overall, this is a valuable resource for anyone looking to understand the latest advances in multimodal LLMs. The author has done an excellent job of synthesizing a large amount of complex information into a readable and informative article. The inclusion of the author's own book at the end feels a bit self-promotional, but it's understandable given the amount of work that clearly went into this piece. Well done!
Great article! Thank you for sharing it!
A quick question: is the article available as a PDF, by any chance? It would make highlighting and writing notes easier for me. Thank you!
Thanks! Unfortunately I don't have a PDF version. However, I just checked and the browser's PDF export function seems to produce reasonable results if that helps
Thank you! Yes, indeed, the PDF export from Chrome with some custom settings worked pretty fine!
(I was initially trying Safari, but some images were not exported to the PDF)
Great article. I often get the crux just by your visualizations and then read on to reconfirm my understanding. Kudos to your visualizations. I really love them.
One question: I wonder these 2 methods also applicable to other non text modalities (video or audio) as well ?
Glad to hear! And yes, they are applicable to audio and video as well. In that case, you would replace the image encoder by an audio encoder for example.
Thanks Sebastian for this.
Another excellent article, Sebastian!! Thanks for writing. And I love the care and attention you bring to everything (e.g., the regular attention and cross-attention pictures). And I love how you interleave text and code explanations - your writing is itself multimodal :-).
A meta question: Given the time investment needed to produce this high-quality work, how do you decide what to work on next?
Thanks for the kind words! It's a big time investment actually, so to answer your question, I only do this for topics I really care about or am currently excited about :)
Amazing! Thank you for such in-depth articles.
The only thing that remains unclear to me is the differences in the stages of multimodal training process. That is, what is the difference between pretrain and sft stages when training multimodal models? Perhaps I did not ask the correct question, I just got confused, also considering the fact that different authors train and freeze different parts of the models at different stages)
It is also not entirely clear what the data looks like and what it is at different stages of training.
Thanks, and these are good questions. The data options consist of text-only (for LLM pretraining) image-only or image-text pairs for the image encoder pretraining. Image-text for the training of the multimodal system, and visual Q&A for finetuning. But yeah, it's a whole topic in itself and the article is already so long, but it would be a good topic for a "Training Multimodal LLMs" article in the future :)
I might be wrong, but this confuses me:
"(The image encoder operates on images with a resolution of 224×224, dividing them into 16×16 patches of uniform size.)"
-> do you mean 14×14 patches, each of size 16×16 >pixels<?
"As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 parameters.)"
-> 20B
and here it the URL missing:
"Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article, URL:), they follow up with instruction and preference finetuning."
Thanks for those, will update! Btw for the initial question, yes the 16x16 refers to the patch size
Ah, yeah, I got it - thanks a lot!
Great overview! How do you find the time to draw all those detailed visualizations? ;)
A question to 2.1.3: So for training the model, the input text must be a description of the image, right?
haha yeah, it took me a while. Almost a month I think :P.
Yes, the training data consists of paired image-text data.
Alright, thanks! Great work, as usual!
As the article is quite long, I would like to suggest that it would be great if you could also add a TOC to the beginning of the articles as this would make navigation in the text more convenient for readers :)
Yeah, I wish substack adds a ToC feature one day!
Actually, I stand corrected. There is a TOC now! You have to click on the "lines" in the left to make it pop up. Can't post a screenshot here directly, but see the link here: https://drive.google.com/file/d/1Y9G4G5DhOhTAb5G8Jxh5GJQBzlHRsXoZ/view?usp=sharing
Oh, wow, you are right! Nice feature - but REALLY hard to see, very sub-optimal usability imho...
Agreed, the discoverability of that feature is not great
Great article! The comparison between the two training approaches is interesting. Definitely learned something new!
Thanks for the kind words and I am glad to hear this!
Very nice, I truly enjoyed reading it
Thanks!!
On Emu3: Next-Token Prediction is All You Need presents a significant contribution to the field of image generation. LlamaGen is a compelling development, showing that vanilla autoregressive models, like Llama, can achieve strong results in image generation without specialized visual biases—given the right scale. The comprehensive approach, from optimized tokenizers to scalable models, demonstrates that autoregressive methods can compete with diffusion models like LDM and DiT on ImageNet benchmarks. Plus, the impressive speedup via LLM serving frameworks adds real-world feasibility for faster inference. LlamaGen’s open-source release is a huge win for the community, providing robust resources for advancing visual generation and multimodal models
Key strengths of the paper:
Image tokenizer: The authors develop an image tokenizer that achieves impressive reconstruction quality and codebook usage, showing that discrete representation in image tokenizers is no longer a bottleneck in image reconstruction.
Scalable image generation model: They develop a series of class-conditional image generation models based on the Llama architecture, with the largest model outperforming popular diffusion models on the ImageNet benchmark. This demonstrates that vanilla autoregressive models without inductive biases on visual signals can serve as the basis for effective image generation systems.
High-quality training data: The authors train a text-conditional image generation model on a large dataset and fine-tune it on a smaller dataset of high aesthetic quality images, demonstrating competitive performance in terms of visual quality and text alignment.
Optimized inference speed: By adopting the vLLM framework, the authors achieve significant speedups in the inference speed of their image generation models.
Open-source contribution: The authors release all models and code, making a valuable contribution to the open-source community in visual generation and multimodal foundation models.
A potential area for improvement:
Comparison with state-of-the-art: While the authors acknowledge that their released models are still behind state-of-the-art visual generation models based on diffusion models, it would be interesting to see a more detailed comparison and discussion of the potential for autoregressive models to close this gap with more training data and computational resources.
Overall, this paper makes a significant contribution by demonstrating the potential of autoregressive models in image generation and providing valuable resources to the open-source community. The authors' approach of reducing inductive biases on visual signals and adopting the same architecture as language models is a promising direction for developing unified models between language and vision. The impressive results achieved with their image tokenizer, scalable models, and optimized inference speed highlight the potential of this approach. Well done!
This is an impressive and comprehensive overview of the latest developments in multimodal large language models (LLMs). The author has done an excellent job of explaining the key concepts and methods in an accessible way, while also providing a detailed look at several recent research papers on the topic.
Some key strengths of the article:
Clear explanations of the two main approaches to building multimodal LLMs - the Unified Embedding Decoder Architecture (Method A) and the Cross-Modality Attention Architecture (Method B). The author breaks down these methods with helpful diagrams and code examples.
A thorough review of 10 recent research papers, covering a diverse range of models and approaches. This gives the reader a good sense of the current state-of-the-art in multimodal LLMs.
Useful comparisons and analysis, such as the discussion of the trade-offs between Methods A and B, and the summary table at the end comparing the different models.
Coverage of some novel ideas and approaches, such as the Naive Dynamic Resolution mechanism in Qwen2-VL for handling varying image resolutions, and the Emu3 model for image generation using a transformer decoder.
A few potential areas for improvement:
The article is quite long and dense with information, which might be overwhelming for some readers. Breaking it up into more sections with clear headings could help with readability.
While the author acknowledges the difficulty of comparing model performance, it would still be interesting to see some discussion of how these models perform on standard benchmarks or real-world tasks.
The article focuses primarily on image-text models, but it would be nice to see more discussion of other modalities like video and speech (which are briefly mentioned in the Llama 3 section).
Overall, this is a valuable resource for anyone looking to understand the latest advances in multimodal LLMs. The author has done an excellent job of synthesizing a large amount of complex information into a readable and informative article. The inclusion of the author's own book at the end feels a bit self-promotional, but it's understandable given the amount of work that clearly went into this piece. Well done!