9 Comments
Jul 2Liked by Sebastian Raschka, PhD

Hi Sebastian!

re: instruction tuning, I'm wondering if it would ever make sense to either

1. weight the loss function to consider output tokens *more* than instruction tokens

2. train on the full task for an epoch (or therabouts) and then on only the output

Expand full comment
author

I do think that this weightning would make sense. It would basically be an in-between of masking and not masking. It's also easy to implement too since the cross entropy loss in PyTorch has a `weight` attribute.

Regarding your 2nd point, I would also say yes. Both are valid points but yeah, it would require some empirical experiments to see how they work out in practice.

Expand full comment
Jun 4Liked by Sebastian Raschka, PhD

Hi, great post, as always!

You mention that the authors of the instruction tuning paper do not mask the template, however, from this fragment of the paper: "The loss function, L, for instruction modelling calculates the negative log-likelihood for both instruction and completion tokens, excluding any prompt template tokens. " I am not sure if that is true?

Do I miss something?

Expand full comment
author

Thanks for the comment. Yes, that's correct. I mentioned this here: " It's the method that the authors refer to in the paper as "instruction modeling." (In the paper, they additionally mask special prompt tokens like <|user|>, <|assistant|>, and <|system|> that may occur in non-Alpaca prompt templates.)".

I.e., they don't mask anything except the special tokens.

Expand full comment
Jun 2Liked by Sebastian Raschka, PhD

thanks for the wonderful read Sebastian.

Do you have any thoughts on whether including the instruction in the loss function makes the task a 'continued pre training' and not 'an instruction finetuning' again? i mean how does this differ from a continued pre-training then?

Expand full comment
author

Good question, but it would still be considered instruction finetuning because you calculate the loss over the inputs differently. In continued pretraining, you feed the LLM more text chunks whereas in instruction finetuning, the next-token prediction is for the whole training example at once. I will explain this more with a hands-on example in the upcoming chapter 7 of my Build an LLM from Scratch book.

Expand full comment
Jun 2Liked by Sebastian Raschka, PhD

alright. i've got your book and would be looking forward to it, but just to see if i get you correctly, you're saying if the task were to be considered as continued pre training then each token in the instruction and response would be generated autoregressively and the loss would be computed for each token. the main difference in this is that here the loss is computed on the entire instruction and response (considering all of that as a sort of next token) right?

Expand full comment
author

Yes, for a text example with n tokens, you compute the loss over n-1 pretraining tasks. In instruction finetuning, you compute the loss only over 1 task in this case. (Of course, you can have multiple training examples, but in 1 epoch, the model only sees each token exactly once in instruction finetuning)

Expand full comment

alright. got you 👍

Expand full comment