However, in the last "overview" diagram, the "Method" of Molmo and NVLM seems to be filled in incorrectly. That is, "Both + Hybrid" should correspond to NVLM instead of Molmo.
As the article is quite long, I would like to suggest that it would be great if you could also add a TOC to the beginning of the articles as this would make navigation in the text more convenient for readers :)
I've been trying to learn more about multimodal LLMs too and I found your blog post very helpful! Thanks! 🙏
I was recently reading about the Segment Anything Model (SAM) and the Florence-2 model. I suppose in your terminology, SAM uses a strategy like method B "cross-modality", while Florence-2 uses a strategy like your method A "Unified Embedding". However, these categories don't match up perfectly I guess because Florence-2 uses an encoder/decoder LLM and they add new "location tokens" to the vocabulary of their pre-trained LLM's tokenizer (which I guess is a little more than further training an LLM or a "projector" MLP 🤷♂️) and SAM generates masks instead of text output 😅
I was personally interested in knowing more about the SAM because I wanted to know more about how SAM and other multimodal LLMs handle ambiguity and different scales of objects in an image. For example, if an image has multiple people in it and I ask my MM-LLM a question about a person, then how does it know if I'm referring to a large person in the foreground or a small person in the background 🤔 😅
Anyways, thanks again for your excellent posts! I've been benefitting from your explanations for a long time 🙂 I want to support you more so I just bought a copy of your book "Building an LLM (from scratch)" 🎉 I'm excited to read it!
Thanks for the detailed comment (and the kind words)! I must say that I haven't had a chance to look at SAM recently. I read about it like a 1 1/2 years ago or so and don't remember the details (as I am not really a user of segmentation models). But what you describe is probably right!
Thank you for your informative content. I have read your book. Can I translate your book into Persian? What permissions do I need to obtain and how can I make a financial agreement with you for this? (Of course, if your answer is yes)
Thanks for your interest in translating the book! The translation rights are handled directly by the publisher. You can inquire about it via email at rights@manning.com. If you don't hear back from them, please let me know and I can also try to reach them.
Thanks for the great article. I really appreciate the work behind putting this together. While reading, I noticed that the hyperlink referenced by this article "Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs " is broken. Hopefully, you're able to fix it.
Thanks Seb -- looks like a mistake to me, from what I understand of the Molmo picture?
"Illustration of the Molmo decoder-only approach (Method B)."
But Method B is the cross-attention method, and it seems like from the image that Molmo is using Method A, where the image/text tokens all compose the context that gets shoved into the model.
I don't understand what you mean by Llama 3.2 training their image encoder from scratch? My understanding is that they use a pre-trained alternative to CLIP (MetaCLIP, from "Demystifying clip data"). The distinction between "Trained from scratch" and "Further training" is a bit confusing in the table, as most of these models are pretrained in a way that's unrelated to the LLM.
Thanks for the comment, Adam. To me, it sounded like they used a CLIP-like architecture but trained it themselves. This is based on the following quote from the Llama 3 paper: "We train separate encoders for images and speech. We train our image encoder on large amounts of image-text pairs. This teaches the model the relation between visual
content and the description of that content in natural language."
Please correct me if I'm wrong though; I may have missed the part where they said they initialized it with pretrained weights.
Thanks! Unfortunately I don't have a PDF version. However, I just checked and the browser's PDF export function seems to produce reasonable results if that helps
Great article. I often get the crux just by your visualizations and then read on to reconfirm my understanding. Kudos to your visualizations. I really love them.
One question: I wonder these 2 methods also applicable to other non text modalities (video or audio) as well ?
Glad to hear! And yes, they are applicable to audio and video as well. In that case, you would replace the image encoder by an audio encoder for example.
Another excellent article, Sebastian!! Thanks for writing. And I love the care and attention you bring to everything (e.g., the regular attention and cross-attention pictures). And I love how you interleave text and code explanations - your writing is itself multimodal :-).
A meta question: Given the time investment needed to produce this high-quality work, how do you decide what to work on next?
Thanks for the kind words! It's a big time investment actually, so to answer your question, I only do this for topics I really care about or am currently excited about :)
Thank you for sharing it!
However, in the last "overview" diagram, the "Method" of Molmo and NVLM seems to be filled in incorrectly. That is, "Both + Hybrid" should correspond to NVLM instead of Molmo.
You are right, looks like this was a row swap. Should be corrected now. Thanks for letting me know!
Great overview! How do you find the time to draw all those detailed visualizations? ;)
A question to 2.1.3: So for training the model, the input text must be a description of the image, right?
haha yeah, it took me a while. Almost a month I think :P.
Yes, the training data consists of paired image-text data.
Alright, thanks! Great work, as usual!
As the article is quite long, I would like to suggest that it would be great if you could also add a TOC to the beginning of the articles as this would make navigation in the text more convenient for readers :)
Yeah, I wish substack adds a ToC feature one day!
Actually, I stand corrected. There is a TOC now! You have to click on the "lines" in the left to make it pop up. Can't post a screenshot here directly, but see the link here: https://drive.google.com/file/d/1Y9G4G5DhOhTAb5G8Jxh5GJQBzlHRsXoZ/view?usp=sharing
Oh, wow, you are right! Nice feature - but REALLY hard to see, very sub-optimal usability imho...
Agreed, the discoverability of that feature is not great
Aria has 25.3B total parameters
Thanks for the note. Hm, weird. Must be a typo in their paper: https://arxiv.org/pdf/2410.05993v4
Bottom of the first page they write:
> and has a total of 24.9B parameters
I just checked with the author and they said it was paper typo.
Good Article!Thanks!
how well curated this is!! Great work.
I've been trying to learn more about multimodal LLMs too and I found your blog post very helpful! Thanks! 🙏
I was recently reading about the Segment Anything Model (SAM) and the Florence-2 model. I suppose in your terminology, SAM uses a strategy like method B "cross-modality", while Florence-2 uses a strategy like your method A "Unified Embedding". However, these categories don't match up perfectly I guess because Florence-2 uses an encoder/decoder LLM and they add new "location tokens" to the vocabulary of their pre-trained LLM's tokenizer (which I guess is a little more than further training an LLM or a "projector" MLP 🤷♂️) and SAM generates masks instead of text output 😅
I was personally interested in knowing more about the SAM because I wanted to know more about how SAM and other multimodal LLMs handle ambiguity and different scales of objects in an image. For example, if an image has multiple people in it and I ask my MM-LLM a question about a person, then how does it know if I'm referring to a large person in the foreground or a small person in the background 🤔 😅
Anyways, thanks again for your excellent posts! I've been benefitting from your explanations for a long time 🙂 I want to support you more so I just bought a copy of your book "Building an LLM (from scratch)" 🎉 I'm excited to read it!
Thanks for the detailed comment (and the kind words)! I must say that I haven't had a chance to look at SAM recently. I read about it like a 1 1/2 years ago or so and don't remember the details (as I am not really a user of segmentation models). But what you describe is probably right!
Thank you for your informative content. I have read your book. Can I translate your book into Persian? What permissions do I need to obtain and how can I make a financial agreement with you for this? (Of course, if your answer is yes)
Thanks for your interest in translating the book! The translation rights are handled directly by the publisher. You can inquire about it via email at rights@manning.com. If you don't hear back from them, please let me know and I can also try to reach them.
Thanks for the great article. I really appreciate the work behind putting this together. While reading, I noticed that the hyperlink referenced by this article "Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs " is broken. Hopefully, you're able to fix it.
Thanks for the note. Somehow I missed this earlier but just took care of it and fixed it!
Okay 👌 machine learning then
Thanks Seb -- looks like a mistake to me, from what I understand of the Molmo picture?
"Illustration of the Molmo decoder-only approach (Method B)."
But Method B is the cross-attention method, and it seems like from the image that Molmo is using Method A, where the image/text tokens all compose the context that gets shoved into the model.
Thanks for the note, you are right, it should have said "A" not "B". Fixed it!
One perspective shows much, but multimodality reveals the depth beneath.
I don't understand what you mean by Llama 3.2 training their image encoder from scratch? My understanding is that they use a pre-trained alternative to CLIP (MetaCLIP, from "Demystifying clip data"). The distinction between "Trained from scratch" and "Further training" is a bit confusing in the table, as most of these models are pretrained in a way that's unrelated to the LLM.
Thanks for the comment, Adam. To me, it sounded like they used a CLIP-like architecture but trained it themselves. This is based on the following quote from the Llama 3 paper: "We train separate encoders for images and speech. We train our image encoder on large amounts of image-text pairs. This teaches the model the relation between visual
content and the description of that content in natural language."
Please correct me if I'm wrong though; I may have missed the part where they said they initialized it with pretrained weights.
Great article! Thank you for sharing it!
A quick question: is the article available as a PDF, by any chance? It would make highlighting and writing notes easier for me. Thank you!
Thanks! Unfortunately I don't have a PDF version. However, I just checked and the browser's PDF export function seems to produce reasonable results if that helps
Thank you! Yes, indeed, the PDF export from Chrome with some custom settings worked pretty fine!
(I was initially trying Safari, but some images were not exported to the PDF)
Great article. I often get the crux just by your visualizations and then read on to reconfirm my understanding. Kudos to your visualizations. I really love them.
One question: I wonder these 2 methods also applicable to other non text modalities (video or audio) as well ?
Glad to hear! And yes, they are applicable to audio and video as well. In that case, you would replace the image encoder by an audio encoder for example.
Thanks Sebastian for this.
Another excellent article, Sebastian!! Thanks for writing. And I love the care and attention you bring to everything (e.g., the regular attention and cross-attention pictures). And I love how you interleave text and code explanations - your writing is itself multimodal :-).
A meta question: Given the time investment needed to produce this high-quality work, how do you decide what to work on next?
Thanks for the kind words! It's a big time investment actually, so to answer your question, I only do this for topics I really care about or am currently excited about :)