17 Comments

I’m going to run Magpie on nemotron 340B. Should’ve thought of this earlier! Then we get permissive IFT datasets similar to what they paid for (Eg helpsteer 2)

Expand full comment

Let me know if & how that works! (Not sure how well it generalizes to other architectures)

Expand full comment

Needed to start a bug bounty on the implementation.... https://x.com/natolambert/status/1814735390877884823

Expand full comment

Instead of Huggingface, I think llama.cpp / Ollama would be even better :)

Expand full comment

Seems related to OpenAIs new job post on “mid training” https://openai.com/careers/research-scientist-mid-training/

Expand full comment

is mid training just instructiong fine tuning?

Expand full comment

It’s messier

Expand full comment

thanks for the reply! could you share papers talking about it? i can't find much about mid-training.

Expand full comment

I should write about it, not clear

Expand full comment

Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).

Expand full comment

Thanks for the comment! I updated the comment about the dataset sizes right away:

> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)

Expand full comment

Thanks!!!

Expand full comment

Seb, a quick question.

So in 'Instruction Pretraining LLMs' (i.e., InstructPT), what we're essentially doing is not masking the prompt tokens, right? I guess this is the default in Hugging Face's trl (where the entire instruction and output tokens are used), unless we use a data collator. I'm also about to read about SFT, but if you could give a spoiler: is masking the prompt standard practice, or is training only on completions what is actually called SFT?

Expand full comment

Hi there. That's a good question. Personally, I don't use the Hugging Face trl library and thus haven't looked at their source code. But in general, whether or not the prompt tokens is a hyperparameter: one setting is not universally better than the other.

I think masking is fairly common, but whether it adds benefits depends a bit on the dataset size and lengths. In my book (http://mng.bz/orYv), I have the instruction masking part as an exercise in Chapter 7, for example.

There was also a recent research paper where they did an empirical comparison and found that instruction masking can actually be worse. I discussed it here: https://magazine.sebastianraschka.com/p/llm-research-insights-instruction

Expand full comment

Thanks a lot, Seb, for your reply! I can't say this enough, but replies and responses like these motivate many of us, and we learn a lot. Thanks again, onto chapter 7 now.

Expand full comment

The link to the notebook for generating instruction dataset ("Here") seems broken as its pointing to -https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb - I think its supposed to point to - https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb

Expand full comment

Thanks! Should be fixed now!

Expand full comment