New LLM Pre-training and Post-training…

Sebastian Raschka, PhD

Aug 17, 2024

316

A Look at How Moderns LLMs Are Trained

Read →

29 Comments

Bufort

Aug 17

Amazing summary thank you !

Just a quick question regarding the qwen 2 training.

I read in the report

"Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the

Qwen2 pre-training process to enhance in-context learning and instruction-following abilities."

=> it means that there is some QA format no ? (more than a simple quality stage)

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 17

Thanks for the kind note! You are absolutely right, there was this sentence at the end of the "Distribution Improvement" paragraph that I totally overlooked. I updated the article! Thanks!

Expand full comment

Maria Mouschoutzi

Aug 26

This deep dive into LLM pre-training and post-training paradigms is fascinating. It's amazing to see how much the field has evolved with different models like Qwen 2, Apple's AFM, and Llama. Definitely learned a lot—thanks for sharing this! 🙏

Expand full comment

Sam

Nov 26

Thanks Sebastian!

Don't mean to nit, but: "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"

I don't think this is worded correctly -- I think you're talking about a situation where a sentence is tokenized with vocab size |V| into N tokens, and the same sentence _might_ be tokenized with vocab size |2V| into N/2 tokens, and so you can fit twice the amount of _text_ into the same context window, right?

So it's not that the LLM can fit more tokens into the same input by increasing vocab size. It's got a context window that stays fixed when you change vocab size.

Am I interpreting you right? I know you know this, just an unasked-for tip :)

Expand full comment

Reply (2)

Sebastian Raschka, PhD

Nov 26

Thanks for the kind comment! It's actually helpful to know that this is still confusing... Maybe changing "tokens" to "text" will help clarify? I.e.,

"as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so we can fit more text into the same input"

Expand full comment

Sam

Nov 26

Ah, I see other people have brought this up too, now that I look at the comments.

I think it's semantically different enough to warrant changing! People like me are continuing to read your great articles months after they're published, and I'm sure you agree that novices deserve specific explanations :P

Expand full comment

Brian

Nov 20

Thanks for sharing!

Expand full comment

Andrey Cheptsov

Sep 23

Amazing post! Thank you so much. Would be also great to have it updated with the Qwen 2.5 information.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Sep 23

Thanks! When I understand it correctly though, there is no Qwen 2.5 paper (yet), only papers for their specialized models:

- Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, https://arxiv.org/abs/2409.12122

- Qwen2.5-Coder Technical Report, https://arxiv.org/abs/2409.12186

But yeah, maybe interesting for a follow-up article some time :)

Expand full comment

Logan Thorneloe

Sep 6

Thanks for an excellent overview! Can you explain "as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input"? This isn't intuitive to me.

Expand full comment

Reply (2)

Sebastian Raschka, PhD

Sep 7

Good question. That's because the algorithm contains more unique words. E.g., as a simple illustration, a small vocabulary may be:

- note

- book

- air

- port

and with a bigger vocabulary you might have

- note

- book

- notebook

- air

- port

- airport

etc.

I.e. If you have a sentence:

"I used my notebook at the airport", a tokenizer with a small vocabulary would produce more tokens (here 2 tokens each for "notebook" and "airport" instead of 1 token each)

Expand full comment

Pradeep G

Sep 24

Sebastian has explained the topic nicely in his comment but the last part of this sentence in the article does seem off. It says "so the LLM can fit more tokens into the same input" which sounds like more tokens are needed for the same input which is counter to the point being made. It should say "so the LLM needs less tokens to fit the same input".

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Sep 25

Good call out. I probably meant to say "so the LLM can fit more text into the same input" not "so the LLM can fit more tokens into the same input"

Expand full comment

AlphaSue

Aug 22

"In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size). However, in section 3.4.3, they state that the annealing was done only on 40 million tokens (0.1% of the annealing data)."

I think the origin paper didn't mention how many tokens they use for annealing. They just mentioned using 40B annealing data for experiment of different data set quality.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 22

Thanks for the feedback. I think you are right, the annealing on the 40B dataset was only done to assess data quality before the actual annealing. I reworded it to

> In section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size); this 40B annealing dataset was used to assess data quality. In section 3.4.3, they state that the actual annealing was done only on 40 million tokens (0.1% of the annealing data).

Thanks!

Expand full comment

Daniel Kleine

Aug 20

Thanks for this great summary!

About the figures, I think it would be great if you would have colored the check marks in green and the cross marks in red. Also the cross marks (X) can be misleading this can also be interpreted as a tick, like in a form. I think it would be better better to use a "deny" symbol (the circle with the diagonal band in the center). For me, it's was hard to understand the figures because of ambiguous symbols and the contradictory color assignments used.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 22

Thanks for the feedback! In hindsight, I probably should have used "Yes" and "No" instead to make it more clear.

Expand full comment

Bradley

Aug 20

Thanks or such an insightful and detailed overview Sebastian!

As someone working with research participants, this article really highlights how crucial diverse and high-quality data is in shaping the future of AI.

Your breakdown of the pre-training and post-training methodologies, especially the focus on data filtering and human feedback, provides valuable context for how we can contribute to improving these models.

Truly appreciate your work and look forward to more of your content!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 22

Thanks so much!

Expand full comment

Luis Pessoa

Aug 19

Great article! Thanks!!

You mentioned that, for some reason, Qwen is less popular than other open-weight models. Well, I hadn't been following Qwen's progress closely. At least until now :). The main reason is that I’m more interested in multilingual models (I speak Portuguese). Qwen's previous versions were more focused on Chinese and English. After reading your article, I visited the Qwen2 blog (https://qwenlm.github.io/blog/qwen2/) and found this: "Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training."

That's great!

One last comment: your article includes this passage: "Specifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable 3-billion-parameter server model." However, I think the AFM server model does not have 3B parameters. It seems to be a larger model.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 19

Thanks so much for the comment. Glad to hear that the Qwen 2 model may come in handy! Regarding the AFM server model: great point, I am not sure why I wrote 3B there; I updated it to "and a more capable server model of unspecified size. "

Expand full comment

Sowmya

Aug 19

Minor edit suggestion: "(If you want to learn how DPO works, I recently implemented it from scratch here.)" - the url link is missing in this.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 19

Thanks! Just went ahead and inserted the missing link. For easy reference: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb

Expand full comment

Dr Nia D Thomas

Aug 19

LLM still means Master of Laws to me, i.e. the degree after the bachelors!

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 19

Ha, I had no idea that this abbreviation existed and just learned something new :).

Expand full comment

Reply (1)

Dr Nia D Thomas

Aug 19

😄 Every day's a school day!

Expand full comment

Reply (1)

Daniel Kleine

Aug 20

But that's LL. M., not LLM

Expand full comment

Reply (1)

Dr Nia D Thomas

Aug 20

🤣 thanks for clarifying

Expand full comment

Animesh Bote

Jun 16

as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so we can fit more text into the same input. <- If I increase the vocab size, it should increase the input tokens instead of reducing it right? Am I missing any context here?

Expand full comment