I’m going to run Magpie on nemotron 340B. Should’ve thought of this earlier! Then we get permissive IFT datasets similar to what they paid for (Eg helpsteer 2)
Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).
Thanks for the comment! I updated the comment about the dataset sizes right away:
> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)
So in 'Instruction Pretraining LLMs' (i.e., InstructPT), what we're essentially doing is not masking the prompt tokens, right? I guess this is the default in Hugging Face's trl (where the entire instruction and output tokens are used), unless we use a data collator. I'm also about to read about SFT, but if you could give a spoiler: is masking the prompt standard practice, or is training only on completions what is actually called SFT?
Hi there. That's a good question. Personally, I don't use the Hugging Face trl library and thus haven't looked at their source code. But in general, whether or not the prompt tokens is a hyperparameter: one setting is not universally better than the other.
I think masking is fairly common, but whether it adds benefits depends a bit on the dataset size and lengths. In my book (http://mng.bz/orYv), I have the instruction masking part as an exercise in Chapter 7, for example.
Thanks a lot, Seb, for your reply! I can't say this enough, but replies and responses like these motivate many of us, and we learn a lot. Thanks again, onto chapter 7 now.
I’m going to run Magpie on nemotron 340B. Should’ve thought of this earlier! Then we get permissive IFT datasets similar to what they paid for (Eg helpsteer 2)
Let me know if & how that works! (Not sure how well it generalizes to other architectures)
Needed to start a bug bounty on the implementation.... https://x.com/natolambert/status/1814735390877884823
Instead of Huggingface, I think llama.cpp / Ollama would be even better :)
Seems related to OpenAIs new job post on “mid training” https://openai.com/careers/research-scientist-mid-training/
is mid training just instructiong fine tuning?
It’s messier
thanks for the reply! could you share papers talking about it? i can't find much about mid-training.
I should write about it, not clear
Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).
Thanks for the comment! I updated the comment about the dataset sizes right away:
> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)
Thanks!!!
Seb, a quick question.
So in 'Instruction Pretraining LLMs' (i.e., InstructPT), what we're essentially doing is not masking the prompt tokens, right? I guess this is the default in Hugging Face's trl (where the entire instruction and output tokens are used), unless we use a data collator. I'm also about to read about SFT, but if you could give a spoiler: is masking the prompt standard practice, or is training only on completions what is actually called SFT?
Hi there. That's a good question. Personally, I don't use the Hugging Face trl library and thus haven't looked at their source code. But in general, whether or not the prompt tokens is a hyperparameter: one setting is not universally better than the other.
I think masking is fairly common, but whether it adds benefits depends a bit on the dataset size and lengths. In my book (http://mng.bz/orYv), I have the instruction masking part as an exercise in Chapter 7, for example.
There was also a recent research paper where they did an empirical comparison and found that instruction masking can actually be worse. I discussed it here: https://magazine.sebastianraschka.com/p/llm-research-insights-instruction
Thanks a lot, Seb, for your reply! I can't say this enough, but replies and responses like these motivate many of us, and we learn a lot. Thanks again, onto chapter 7 now.
The link to the notebook for generating instruction dataset ("Here") seems broken as its pointing to -https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb - I think its supposed to point to - https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
Thanks! Should be fixed now!