Instruction Pretraining LLMs

Sebastian Raschka, PhD

Jul 20, 2024

182

The Latest Research in Instruction Finetuning

Read →

17 Comments

Nathan Lambert

Jul 20, 2024

I’m going to run Magpie on nemotron 340B. Should’ve thought of this earlier! Then we get permissive IFT datasets similar to what they paid for (Eg helpsteer 2)

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 20, 2024

Let me know if & how that works! (Not sure how well it generalizes to other architectures)

Expand full comment

Reply (1)

Nathan Lambert

Jul 20, 2024

Needed to start a bug bounty on the implementation.... https://x.com/natolambert/status/1814735390877884823

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 21, 2024

Instead of Huggingface, I think llama.cpp / Ollama would be even better :)

Expand full comment

Nathan Lambert

Jul 20, 2024

Seems related to OpenAIs new job post on “mid training” https://openai.com/careers/research-scientist-mid-training/

Expand full comment

Reply (1)

Benedict Neo

Oct 24

is mid training just instructiong fine tuning?

Expand full comment

Reply (1)

Nathan Lambert

Oct 24

It’s messier

Expand full comment

Reply (1)

Benedict Neo

Oct 24

thanks for the reply! could you share papers talking about it? i can't find much about mid-training.

Expand full comment

Reply (1)

Nathan Lambert

Oct 24

I should write about it, not clear

Expand full comment

Daixuan Cheng

Jul 20, 2024

Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 20, 2024

Thanks for the comment! I updated the comment about the dataset sizes right away:

> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)

Expand full comment

Reply (1)

Daixuan Cheng

Jul 20, 2024

Thanks!!!

Expand full comment

Iqbal Singh

Aug 29

Seb, a quick question.

So in 'Instruction Pretraining LLMs' (i.e., InstructPT), what we're essentially doing is not masking the prompt tokens, right? I guess this is the default in Hugging Face's trl (where the entire instruction and output tokens are used), unless we use a data collator. I'm also about to read about SFT, but if you could give a spoiler: is masking the prompt standard practice, or is training only on completions what is actually called SFT?

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Aug 29

Hi there. That's a good question. Personally, I don't use the Hugging Face trl library and thus haven't looked at their source code. But in general, whether or not the prompt tokens is a hyperparameter: one setting is not universally better than the other.

I think masking is fairly common, but whether it adds benefits depends a bit on the dataset size and lengths. In my book (http://mng.bz/orYv), I have the instruction masking part as an exercise in Chapter 7, for example.

There was also a recent research paper where they did an empirical comparison and found that instruction masking can actually be worse. I discussed it here: https://magazine.sebastianraschka.com/p/llm-research-insights-instruction

Expand full comment

Reply (1)

Iqbal Singh

Aug 29

Thanks a lot, Seb, for your reply! I can't say this enough, but replies and responses like these motivate many of us, and we learn a lot. Thanks again, onto chapter 7 now.

Expand full comment

Manu

Jul 27, 2024

The link to the notebook for generating instruction dataset ("Here") seems broken as its pointing to -https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb - I think its supposed to point to - https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 27, 2024

Thanks! Should be fixed now!

Expand full comment