40 Comments
Nov 11Liked by Sebastian Raschka, PhD

Thank you for sharing it!

However, in the last "overview" diagram, the "Method" of Molmo and NVLM seems to be filled in incorrectly. That is, "Both + Hybrid" should correspond to NVLM instead of Molmo.

Expand full comment
author

You are right, looks like this was a row swap. Should be corrected now. Thanks for letting me know!

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Great overview! How do you find the time to draw all those detailed visualizations? ;)

A question to 2.1.3: So for training the model, the input text must be a description of the image, right?

Expand full comment
author

haha yeah, it took me a while. Almost a month I think :P.

Yes, the training data consists of paired image-text data.

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Alright, thanks! Great work, as usual!

As the article is quite long, I would like to suggest that it would be great if you could also add a TOC to the beginning of the articles as this would make navigation in the text more convenient for readers :)

Expand full comment
author

Yeah, I wish substack adds a ToC feature one day!

Expand full comment
author

Actually, I stand corrected. There is a TOC now! You have to click on the "lines" in the left to make it pop up. Can't post a screenshot here directly, but see the link here: https://drive.google.com/file/d/1Y9G4G5DhOhTAb5G8Jxh5GJQBzlHRsXoZ/view?usp=sharing

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Oh, wow, you are right! Nice feature - but REALLY hard to see, very sub-optimal usability imho...

Expand full comment
author

Agreed, the discoverability of that feature is not great

Expand full comment
Nov 18Liked by Sebastian Raschka, PhD

Thank you for your informative content. I have read your book. Can I translate your book into Persian? What permissions do I need to obtain and how can I make a financial agreement with you for this? (Of course, if your answer is yes)

Expand full comment
author

Thanks for your interest in translating the book! The translation rights are handled directly by the publisher. You can inquire about it via email at rights@manning.com. If you don't hear back from them, please let me know and I can also try to reach them.

Expand full comment

Okay 👌 machine learning then

Expand full comment
Nov 9Liked by Sebastian Raschka, PhD

Thanks Seb -- looks like a mistake to me, from what I understand of the Molmo picture?

"Illustration of the Molmo decoder-only approach (Method B)."

But Method B is the cross-attention method, and it seems like from the image that Molmo is using Method A, where the image/text tokens all compose the context that gets shoved into the model.

Expand full comment
author

Thanks for the note, you are right, it should have said "A" not "B". Fixed it!

Expand full comment
Nov 7Liked by Sebastian Raschka, PhD

I don't understand what you mean by Llama 3.2 training their image encoder from scratch? My understanding is that they use a pre-trained alternative to CLIP (MetaCLIP, from "Demystifying clip data"). The distinction between "Trained from scratch" and "Further training" is a bit confusing in the table, as most of these models are pretrained in a way that's unrelated to the LLM.

Expand full comment
author

Thanks for the comment, Adam. To me, it sounded like they used a CLIP-like architecture but trained it themselves. This is based on the following quote from the Llama 3 paper: "We train separate encoders for images and speech. We train our image encoder on large amounts of image-text pairs. This teaches the model the relation between visual

content and the description of that content in natural language."

Please correct me if I'm wrong though; I may have missed the part where they said they initialized it with pretrained weights.

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Great article! Thank you for sharing it!

A quick question: is the article available as a PDF, by any chance? It would make highlighting and writing notes easier for me. Thank you!

Expand full comment
author

Thanks! Unfortunately I don't have a PDF version. However, I just checked and the browser's PDF export function seems to produce reasonable results if that helps

Expand full comment

Thank you! Yes, indeed, the PDF export from Chrome with some custom settings worked pretty fine!

(I was initially trying Safari, but some images were not exported to the PDF)

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Great article. I often get the crux just by your visualizations and then read on to reconfirm my understanding. Kudos to your visualizations. I really love them.

One question: I wonder these 2 methods also applicable to other non text modalities (video or audio) as well ?

Expand full comment
author

Glad to hear! And yes, they are applicable to audio and video as well. In that case, you would replace the image encoder by an audio encoder for example.

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Thanks Sebastian for this.

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Another excellent article, Sebastian!! Thanks for writing. And I love the care and attention you bring to everything (e.g., the regular attention and cross-attention pictures). And I love how you interleave text and code explanations - your writing is itself multimodal :-).

A meta question: Given the time investment needed to produce this high-quality work, how do you decide what to work on next?

Expand full comment
author

Thanks for the kind words! It's a big time investment actually, so to answer your question, I only do this for topics I really care about or am currently excited about :)

Expand full comment
Nov 3·edited Nov 3Liked by Sebastian Raschka, PhD

Amazing! Thank you for such in-depth articles.

The only thing that remains unclear to me is the differences in the stages of multimodal training process. That is, what is the difference between pretrain and sft stages when training multimodal models? Perhaps I did not ask the correct question, I just got confused, also considering the fact that different authors train and freeze different parts of the models at different stages)

It is also not entirely clear what the data looks like and what it is at different stages of training.

Expand full comment
author

Thanks, and these are good questions. The data options consist of text-only (for LLM pretraining) image-only or image-text pairs for the image encoder pretraining. Image-text for the training of the multimodal system, and visual Q&A for finetuning. But yeah, it's a whole topic in itself and the article is already so long, but it would be a good topic for a "Training Multimodal LLMs" article in the future :)

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

I might be wrong, but this confuses me:

"(The image encoder operates on images with a resolution of 224×224, dividing them into 16×16 patches of uniform size.)"

-> do you mean 14×14 patches, each of size 16×16 >pixels<?

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

"As the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 parameters.)"

-> 20B

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

and here it the URL missing:

"Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article, URL:), they follow up with instruction and preference finetuning."

Expand full comment
author

Thanks for those, will update! Btw for the initial question, yes the 16x16 refers to the patch size

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Ah, yeah, I got it - thanks a lot!

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Great article! The comparison between the two training approaches is interesting. Definitely learned something new!

Expand full comment
author

Thanks for the kind words and I am glad to hear this!

Expand full comment
Nov 3Liked by Sebastian Raschka, PhD

Very nice, I truly enjoyed reading it

Expand full comment
author

Thanks!!

Expand full comment

Thanks for the great article. I really appreciate the work behind putting this together. While reading, I noticed that the hyperlink referenced by this article "Understanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs " is broken. Hopefully, you're able to fix it.

Expand full comment

One perspective shows much, but multimodality reveals the depth beneath.

Expand full comment