I do think that this weightning would make sense. It would basically be an in-between of masking and not masking. It's also easy to implement too since the cross entropy loss in PyTorch has a `weight` attribute.
Regarding your 2nd point, I would also say yes. Both are valid points but yeah, it would require some empirical experiments to see how they work out in practice.
You mention that the authors of the instruction tuning paper do not mask the template, however, from this fragment of the paper: "The loss function, L, for instruction modelling calculates the negative log-likelihood for both instruction and completion tokens, excluding any prompt template tokens. " I am not sure if that is true?
Thanks for the comment. Yes, that's correct. I mentioned this here: " It's the method that the authors refer to in the paper as "instruction modeling." (In the paper, they additionally mask special prompt tokens like <|user|>, <|assistant|>, and <|system|> that may occur in non-Alpaca prompt templates.)".
I.e., they don't mask anything except the special tokens.
Do you have any thoughts on whether including the instruction in the loss function makes the task a 'continued pre training' and not 'an instruction finetuning' again? i mean how does this differ from a continued pre-training then?
Good question, but it would still be considered instruction finetuning because you calculate the loss over the inputs differently. In continued pretraining, you feed the LLM more text chunks whereas in instruction finetuning, the next-token prediction is for the whole training example at once. I will explain this more with a hands-on example in the upcoming chapter 7 of my Build an LLM from Scratch book.
alright. i've got your book and would be looking forward to it, but just to see if i get you correctly, you're saying if the task were to be considered as continued pre training then each token in the instruction and response would be generated autoregressively and the loss would be computed for each token. the main difference in this is that here the loss is computed on the entire instruction and response (considering all of that as a sort of next token) right?
Yes, for a text example with n tokens, you compute the loss over n-1 pretraining tasks. In instruction finetuning, you compute the loss only over 1 task in this case. (Of course, you can have multiple training examples, but in 1 epoch, the model only sees each token exactly once in instruction finetuning)
Hi Sebastian!
re: instruction tuning, I'm wondering if it would ever make sense to either
1. weight the loss function to consider output tokens *more* than instruction tokens
2. train on the full task for an epoch (or therabouts) and then on only the output
I do think that this weightning would make sense. It would basically be an in-between of masking and not masking. It's also easy to implement too since the cross entropy loss in PyTorch has a `weight` attribute.
Regarding your 2nd point, I would also say yes. Both are valid points but yeah, it would require some empirical experiments to see how they work out in practice.
Hi, great post, as always!
You mention that the authors of the instruction tuning paper do not mask the template, however, from this fragment of the paper: "The loss function, L, for instruction modelling calculates the negative log-likelihood for both instruction and completion tokens, excluding any prompt template tokens. " I am not sure if that is true?
Do I miss something?
Thanks for the comment. Yes, that's correct. I mentioned this here: " It's the method that the authors refer to in the paper as "instruction modeling." (In the paper, they additionally mask special prompt tokens like <|user|>, <|assistant|>, and <|system|> that may occur in non-Alpaca prompt templates.)".
I.e., they don't mask anything except the special tokens.
thanks for the wonderful read Sebastian.
Do you have any thoughts on whether including the instruction in the loss function makes the task a 'continued pre training' and not 'an instruction finetuning' again? i mean how does this differ from a continued pre-training then?
Good question, but it would still be considered instruction finetuning because you calculate the loss over the inputs differently. In continued pretraining, you feed the LLM more text chunks whereas in instruction finetuning, the next-token prediction is for the whole training example at once. I will explain this more with a hands-on example in the upcoming chapter 7 of my Build an LLM from Scratch book.
alright. i've got your book and would be looking forward to it, but just to see if i get you correctly, you're saying if the task were to be considered as continued pre training then each token in the instruction and response would be generated autoregressively and the loss would be computed for each token. the main difference in this is that here the loss is computed on the entire instruction and response (considering all of that as a sort of next token) right?
Yes, for a text example with n tokens, you compute the loss over n-1 pretraining tasks. In instruction finetuning, you compute the loss only over 1 task in this case. (Of course, you can have multiple training examples, but in 1 epoch, the model only sees each token exactly once in instruction finetuning)
alright. got you 👍