Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to t…
Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).
Thanks for the comment! I updated the comment about the dataset sizes right away:
> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)
Thanks so much for sharing our instruction pre-training work!!! 💗 We would like to clarify that although instruction pre-training augments the raw corpora with some instruction data, we always compare models pre-trained with the same number of tokens (i.e., Instruction PT sees the same number of tokens as Vanilla PT). You may refer to this figure to see the performance trend: [Performance Trend](https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png).
Thanks for the comment! I updated the comment about the dataset sizes right away:
> From this comparison, we can see that not simply using any instruction data makes the difference. The fact that Instruct PT performs better than Mix PT on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference. (The authors conducted all experiments using the same number of tokens.)
Thanks!!!