17 Comments
User's avatar
Nathan Lambert's avatar

I’m going to run Magpie on nemotron 340B. Should’ve thought of this earlier! Then we get permissive IFT datasets similar to what they paid for (Eg helpsteer 2)

Expand full comment
Sebastian Raschka, PhD's avatar

Let me know if & how that works! (Not sure how well it generalizes to other architectures)

Expand full comment
Nathan Lambert's avatar

Needed to start a bug bounty on the implementation.... https://x.com/natolambert/status/1814735390877884823

Expand full comment
Sebastian Raschka, PhD's avatar

Instead of Huggingface, I think llama.cpp / Ollama would be even better :)

Expand full comment
Nathan Lambert's avatar

Seems related to OpenAIs new job post on “mid training” https://openai.com/careers/research-scientist-mid-training/

Expand full comment
Benedict Neo's avatar

is mid training just instructiong fine tuning?

Expand full comment
Nathan Lambert's avatar

It’s messier

Expand full comment
Benedict Neo's avatar

thanks for the reply! could you share papers talking about it? i can't find much about mid-training.

Expand full comment
Nathan Lambert's avatar

I should write about it, not clear

Expand full comment
Daixuan Cheng's avatar

Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks for the comment! I updated the comment about the dataset sizes right away:

> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)

Expand full comment
Daixuan Cheng's avatar

Thanks!!!

Expand full comment
Iqbal Singh's avatar

Seb, a quick question.

So in 'Instruction Pretraining LLMs' (i.e., InstructPT), what we're essentially doing is not masking the prompt tokens, right? I guess this is the default in Hugging Face's trl (where the entire instruction and output tokens are used), unless we use a data collator. I'm also about to read about SFT, but if you could give a spoiler: is masking the prompt standard practice, or is training only on completions what is actually called SFT?

Expand full comment
Sebastian Raschka, PhD's avatar

Hi there. That's a good question. Personally, I don't use the Hugging Face trl library and thus haven't looked at their source code. But in general, whether or not the prompt tokens is a hyperparameter: one setting is not universally better than the other.

I think masking is fairly common, but whether it adds benefits depends a bit on the dataset size and lengths. In my book (http://mng.bz/orYv), I have the instruction masking part as an exercise in Chapter 7, for example.

There was also a recent research paper where they did an empirical comparison and found that instruction masking can actually be worse. I discussed it here: https://magazine.sebastianraschka.com/p/llm-research-insights-instruction

Expand full comment
Iqbal Singh's avatar

Thanks a lot, Seb, for your reply! I can't say this enough, but replies and responses like these motivate many of us, and we learn a lot. Thanks again, onto chapter 7 now.

Expand full comment
Manu's avatar

The link to the notebook for generating instruction dataset ("Here") seems broken as its pointing to -https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb - I think its supposed to point to - https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb

Expand full comment
Sebastian Raschka, PhD's avatar

Thanks! Should be fixed now!

Expand full comment