24 Comments
Aug 17Liked by Sebastian Raschka, PhD

Amazing summary thank you !

Just a quick question regarding the qwen 2 training.

I read in the report

"Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the

Qwen2 pre-training process to enhance in-context learning and instruction-following abilities."

=> it means that there is some QA format no ? (more than a simple quality stage)

Expand full comment
author

Thanks for the kind note! You are absolutely right, there was this sentence at the end of the "Distribution Improvement" paragraph that I totally overlooked. I updated the article! Thanks!

Expand full comment
Aug 26Liked by Sebastian Raschka, PhD

This deep dive into LLM pre-training and post-training paradigms is fascinating. It's amazing to see how much the field has evolved with different models like Qwen 2, Apple's AFM, and Llama. Definitely learned a lot—thanks for sharing this! 🙏

Expand full comment
Sep 23Liked by Sebastian Raschka, PhD

Amazing post! Thank you so much. Would be also great to have it updated with the Qwen 2.5 information.

Expand full comment
author

Thanks! When I understand it correctly though, there is no Qwen 2.5 paper (yet), only papers for their specialized models:

- Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, https://arxiv.org/abs/2409.12122

- Qwen2.5-Coder Technical Report, https://arxiv.org/abs/2409.12186

But yeah, maybe interesting for a follow-up article some time :)

Expand full comment
Sep 6Liked by Sebastian Raschka, PhD

Thanks for an excellent overview! Can you explain "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"? This isn't intuitive to me.

Expand full comment
author

Good question. That's because the algorithm contains more unique words. E.g., as a simple illustration, a small vocabulary may be:

- note

- book

- air

- port

and with a bigger vocabulary you might have

- note

- book

- notebook

- air

- port

- airport

etc.

I.e. If you have a sentence:

"I used my notebook at the airport", a tokenizer with a small vocabulary would produce more tokens (here 2 tokens each for "notebook" and "airport" instead of 1 token each)

Expand full comment
Sep 24Liked by Sebastian Raschka, PhD

Sebastian has explained the topic nicely in his comment but the last part of this sentence in the article does seem off. It says "so the LLM can fit more tokens into the same input" which sounds like more tokens are needed for the same input which is counter to the point being made. It should say "so the LLM needs less tokens to fit the same input".

Expand full comment
author

Good call out. I probably meant to say "so the LLM can fit more text into the same input" not "so the LLM can fit more tokens into the same input"

Expand full comment
Aug 22Liked by Sebastian Raschka, PhD

"In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size). However, in section 3.4.3, they state that the annealing was done only on 40 million tokens (0.1% of the annealing data)."

I think the origin paper didn't mention how many tokens they use for annealing. They just mentioned using 40B annealing data for experiment of different data set quality.

Expand full comment
author

Thanks for the feedback. I think you are right, the annealing on the 40B dataset was only done to assess data quality before the actual annealing. I reworded it to

> In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size); this 40B annealing dataset was used to assess data quality. In section 3.4.3, they state that the actual annealing was done only on 40 million tokens (0.1% of the annealing data).

Thanks!

Expand full comment
Aug 20Liked by Sebastian Raschka, PhD

Thanks for this great summary!

About the figures, I think it would be great if you would have colored the check marks in green and the cross marks in red. Also the cross marks (X) can be misleading this can also be interpreted as a tick, like in a form. I think it would be better better to use a "deny" symbol (the circle with the diagonal band in the center). For me, it's was hard to understand the figures because of ambiguous symbols and the contradictory color assignments used.

Expand full comment
author

Thanks for the feedback! In hindsight, I probably should have used "Yes" and "No" instead to make it more clear.

Expand full comment
Aug 20Liked by Sebastian Raschka, PhD

Thanks or such an insightful and detailed overview Sebastian!

As someone working with research participants, this article really highlights how crucial diverse and high-quality data is in shaping the future of AI.

Your breakdown of the pre-training and post-training methodologies, especially the focus on data filtering and human feedback, provides valuable context for how we can contribute to improving these models.

Truly appreciate your work and look forward to more of your content!

Expand full comment
author

Thanks so much!

Expand full comment
Aug 19Liked by Sebastian Raschka, PhD

Great article! Thanks!!

You mentioned that, for some reason, Qwen is less popular than other open-weight models. Well, I hadn't been following Qwen's progress closely. At least until now :). The main reason is that I’m more interested in multilingual models (I speak Portuguese). Qwen's previous versions were more focused on Chinese and English. After reading your article, I visited the Qwen2 blog (https://qwenlm.github.io/blog/qwen2/) and found this: "Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training."

That's great!

One last comment: your article includes this passage: "Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable 3-billion-parameter server model." However, I think the AFM server model does not have 3B parameters. It seems to be a larger model.

Expand full comment
author

Thanks so much for the comment. Glad to hear that the Qwen 2 model may come in handy! Regarding the AFM server model: great point, I am not sure why I wrote 3B there; I updated it to "and a more capable server model of unspecified size. "

Expand full comment
Aug 19Liked by Sebastian Raschka, PhD

Minor edit suggestion: "(If you want to learn how DPO works, I recently implemented it from scratch here.)" - the url link is missing in this.

Expand full comment
author

Thanks! Just went ahead and inserted the missing link. For easy reference: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb

Expand full comment

LLM still means Master of Laws to me, i.e. the degree after the bachelors!

Expand full comment
author

Ha, I had no idea that this abbreviation existed and just learned something new :).

Expand full comment

😄 Every day's a school day!

Expand full comment

But that's LL. M., not LLM

Expand full comment

🤣 thanks for clarifying

Expand full comment