Thanks for the kind note! You are absolutely right, there was this sentence at the end of the "Distribution Improvement" paragraph that I totally overlooked. I updated the article! Thanks!
This deep dive into LLM pre-training and post-training paradigms is fascinating. It's amazing to see how much the field has evolved with different models like Qwen 2, Apple's AFM, and Llama. Definitely learned a lot—thanks for sharing this! 🙏
Don't mean to nit, but: "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"
I don't think this is worded correctly -- I think you're talking about a situation where a sentence is tokenized with vocab size |V| into N tokens, and the same sentence _might_ be tokenized with vocab size |2V| into N/2 tokens, and so you can fit twice the amount of _text_ into the same context window, right?
So it's not that the LLM can fit more tokens into the same input by increasing vocab size. It's got a context window that stays fixed when you change vocab size.
Am I interpreting you right? I know you know this, just an unasked-for tip :)
Ah, I see other people have brought this up too, now that I look at the comments.
I think it's semantically different enough to warrant changing! People like me are continuing to read your great articles months after they're published, and I'm sure you agree that novices deserve specific explanations :P
Thanks for an excellent overview! Can you explain "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"? This isn't intuitive to me.
Good question. That's because the algorithm contains more unique words. E.g., as a simple illustration, a small vocabulary may be:
- note
- book
- air
- port
and with a bigger vocabulary you might have
- note
- book
- notebook
- air
- port
- airport
etc.
I.e. If you have a sentence:
"I used my notebook at the airport", a tokenizer with a small vocabulary would produce more tokens (here 2 tokens each for "notebook" and "airport" instead of 1 token each)
Sebastian has explained the topic nicely in his comment but the last part of this sentence in the article does seem off. It says "so the LLM can fit more tokens into the same input" which sounds like more tokens are needed for the same input which is counter to the point being made. It should say "so the LLM needs less tokens to fit the same input".
"In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size). However, in section 3.4.3, they state that the annealing was done only on 40 million tokens (0.1% of the annealing data)."
I think the origin paper didn't mention how many tokens they use for annealing. They just mentioned using 40B annealing data for experiment of different data set quality.
Thanks for the feedback. I think you are right, the annealing on the 40B dataset was only done to assess data quality before the actual annealing. I reworded it to
> In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size); this 40B annealing dataset was used to assess data quality. In section 3.4.3, they state that the actual annealing was done only on 40 million tokens (0.1% of the annealing data).
About the figures, I think it would be great if you would have colored the check marks in green and the cross marks in red. Also the cross marks (X) can be misleading this can also be interpreted as a tick, like in a form. I think it would be better better to use a "deny" symbol (the circle with the diagonal band in the center). For me, it's was hard to understand the figures because of ambiguous symbols and the contradictory color assignments used.
Thanks or such an insightful and detailed overview Sebastian!
As someone working with research participants, this article really highlights how crucial diverse and high-quality data is in shaping the future of AI.
Your breakdown of the pre-training and post-training methodologies, especially the focus on data filtering and human feedback, provides valuable context for how we can contribute to improving these models.
Truly appreciate your work and look forward to more of your content!
You mentioned that, for some reason, Qwen is less popular than other open-weight models. Well, I hadn't been following Qwen's progress closely. At least until now :). The main reason is that I’m more interested in multilingual models (I speak Portuguese). Qwen's previous versions were more focused on Chinese and English. After reading your article, I visited the Qwen2 blog (https://qwenlm.github.io/blog/qwen2/) and found this: "Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training."
That's great!
One last comment: your article includes this passage: "Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable 3-billion-parameter server model." However, I think the AFM server model does not have 3B parameters. It seems to be a larger model.
Thanks so much for the comment. Glad to hear that the Qwen 2 model may come in handy! Regarding the AFM server model: great point, I am not sure why I wrote 3B there; I updated it to "and a more capable server model of unspecified size. "
Amazing summary thank you !
Just a quick question regarding the qwen 2 training.
I read in the report
"Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities."
=> it means that there is some QA format no ? (more than a simple quality stage)
Thanks for the kind note! You are absolutely right, there was this sentence at the end of the "Distribution Improvement" paragraph that I totally overlooked. I updated the article! Thanks!
This deep dive into LLM pre-training and post-training paradigms is fascinating. It's amazing to see how much the field has evolved with different models like Qwen 2, Apple's AFM, and Llama. Definitely learned a lot—thanks for sharing this! 🙏
Thanks Sebastian!
Don't mean to nit, but: "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"
I don't think this is worded correctly -- I think you're talking about a situation where a sentence is tokenized with vocab size |V| into N tokens, and the same sentence _might_ be tokenized with vocab size |2V| into N/2 tokens, and so you can fit twice the amount of _text_ into the same context window, right?
So it's not that the LLM can fit more tokens into the same input by increasing vocab size. It's got a context window that stays fixed when you change vocab size.
Am I interpreting you right? I know you know this, just an unasked-for tip :)
Thanks for the kind comment! It's actually helpful to know that this is still confusing... Maybe changing "tokens" to "text" will help clarify? I.e.,
"as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so we can fit more text into the same input"
Ah, I see other people have brought this up too, now that I look at the comments.
I think it's semantically different enough to warrant changing! People like me are continuing to read your great articles months after they're published, and I'm sure you agree that novices deserve specific explanations :P
Thanks for sharing!
Amazing post! Thank you so much. Would be also great to have it updated with the Qwen 2.5 information.
Thanks! When I understand it correctly though, there is no Qwen 2.5 paper (yet), only papers for their specialized models:
- Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, https://arxiv.org/abs/2409.12122
- Qwen2.5-Coder Technical Report, https://arxiv.org/abs/2409.12186
But yeah, maybe interesting for a follow-up article some time :)
Thanks for an excellent overview! Can you explain "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"? This isn't intuitive to me.
Good question. That's because the algorithm contains more unique words. E.g., as a simple illustration, a small vocabulary may be:
- note
- book
- air
- port
and with a bigger vocabulary you might have
- note
- book
- notebook
- air
- port
- airport
etc.
I.e. If you have a sentence:
"I used my notebook at the airport", a tokenizer with a small vocabulary would produce more tokens (here 2 tokens each for "notebook" and "airport" instead of 1 token each)
Sebastian has explained the topic nicely in his comment but the last part of this sentence in the article does seem off. It says "so the LLM can fit more tokens into the same input" which sounds like more tokens are needed for the same input which is counter to the point being made. It should say "so the LLM needs less tokens to fit the same input".
Good call out. I probably meant to say "so the LLM can fit more text into the same input" not "so the LLM can fit more tokens into the same input"
"In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size). However, in section 3.4.3, they state that the annealing was done only on 40 million tokens (0.1% of the annealing data)."
I think the origin paper didn't mention how many tokens they use for annealing. They just mentioned using 40B annealing data for experiment of different data set quality.
Thanks for the feedback. I think you are right, the annealing on the 40B dataset was only done to assess data quality before the actual annealing. I reworded it to
> In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size); this 40B annealing dataset was used to assess data quality. In section 3.4.3, they state that the actual annealing was done only on 40 million tokens (0.1% of the annealing data).
Thanks!
Thanks for this great summary!
About the figures, I think it would be great if you would have colored the check marks in green and the cross marks in red. Also the cross marks (X) can be misleading this can also be interpreted as a tick, like in a form. I think it would be better better to use a "deny" symbol (the circle with the diagonal band in the center). For me, it's was hard to understand the figures because of ambiguous symbols and the contradictory color assignments used.
Thanks for the feedback! In hindsight, I probably should have used "Yes" and "No" instead to make it more clear.
Thanks or such an insightful and detailed overview Sebastian!
As someone working with research participants, this article really highlights how crucial diverse and high-quality data is in shaping the future of AI.
Your breakdown of the pre-training and post-training methodologies, especially the focus on data filtering and human feedback, provides valuable context for how we can contribute to improving these models.
Truly appreciate your work and look forward to more of your content!
Thanks so much!
Great article! Thanks!!
You mentioned that, for some reason, Qwen is less popular than other open-weight models. Well, I hadn't been following Qwen's progress closely. At least until now :). The main reason is that I’m more interested in multilingual models (I speak Portuguese). Qwen's previous versions were more focused on Chinese and English. After reading your article, I visited the Qwen2 blog (https://qwenlm.github.io/blog/qwen2/) and found this: "Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training."
That's great!
One last comment: your article includes this passage: "Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable 3-billion-parameter server model." However, I think the AFM server model does not have 3B parameters. It seems to be a larger model.
Thanks so much for the comment. Glad to hear that the Qwen 2 model may come in handy! Regarding the AFM server model: great point, I am not sure why I wrote 3B there; I updated it to "and a more capable server model of unspecified size. "
Minor edit suggestion: "(If you want to learn how DPO works, I recently implemented it from scratch here.)" - the url link is missing in this.
Thanks! Just went ahead and inserted the missing link. For easy reference: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
LLM still means Master of Laws to me, i.e. the degree after the bachelors!
Ha, I had no idea that this abbreviation existed and just learned something new :).
😄 Every day's a school day!
But that's LL. M., not LLM
🤣 thanks for clarifying