11 Comments
Sep 27, 2023Liked by Sebastian Raschka, PhD

Thanks for great post Sebastian. Please correct me if i am wrong: I think in prefix tuning and p-tuning methods LLM actually learns the prompts itself for input and output (we can still write custom input tho like "summarize : <context para>" ). I have seen some examples for Hugging face peft lib thats where they only provide the type of peft method (p-tuning, prefix-tuning etc) to lib without actually prefixing the input from the dataset. https://huggingface.co/docs/peft/task_guides/seq2seq-prefix-tuning/

Please share your views regarding that.

Thanks & Regards

Expand full comment
author

Yes, that's correct @Nav. In the case of prefix tuning, it's a differentiable tensor that is learned via backpropagation.

Expand full comment
May 9, 2023Liked by Sebastian Raschka, PhD

Thank you for a great explanation. I think the "intuition" behind some LLM finetuning methods is applicable to non-LLM models (e.g. ResNet), do you think we can use these techniques to make a "better" finetuned non-LLM models?

Expand full comment
author

Good question! "Regular" finetuning is certainly applicable to non-LLM models like ResNet. Classic transfer learning would be an example. Or unsupervised pretraining (/self-supervised learning) followed by supervised training on a smaller labeled target dataset would be another example.

However, beyond that, you could also apply several parameter-efficient finetuning techniques to non-LLMs. Actually, the most intuitive candidate would be LoRA: https://arxiv.org/abs/2106.09685 (the paper introduces it for LLMs). I haven't checked the code in detail, yet, but there would be an application to diffusion models here: https://github.com/cloneofsimo/lora

Expand full comment

I have a question about the prefix-tuning paper: so the prefix embedding is added to the output of the attention layer or before the attention layer as your illustration here? Because if it's before the attention calculation, there will be q, k and v vectors for each token, then if you concatenate the embedding to both k and v vector, it will modify the dimension of the k vector, which makes it incompatible with q vector in the attention calculation, is that correct?

So is it correct that the prefix embedding is concatenated after go through a MLP along the dimension of the sequence to the output h_i of the attention layer, that's the same dimension of the input token. and the original W_o matrix is still compatible?

And talk about the MLP, is it true that each layer will need a specific MLP for produce the prefix embedding?

Expand full comment
author

As far as I understood and remember, it's before the attention layer. I.e., you modify the input embeddings.

I think you are worried that the input becomes longer so that the dimensions for the matmul don't match anymore. I think that's not the case as the input size stays fixed.

E.g., say you have an LLM with a block size (context size) of 6 tokens (1, 2, 3, 4, 5, 6). Then, imagine you use 256-dimensional embeddings. So in this case the input tensor is 6x256-dimensional.

Now, in prefix tuning, you would first remove e.g. the last two tokens from the right (which are usually padding tokens anyway during inference) so that you are left with 1, 2, 3, 4. Then you prepend the prefix tokens P1, P2, 1, 2, 3, 4. So you don't run into trouble with the matmul dimensions.

Expand full comment

I'm not sure I understand prompt tuning. Do you recommend any resources that I can use to read about it?

Expand full comment
author

You mean soft prompt tuning? Unfortunately, I am not aware of any additional resource besides the paper. But let me know if you have any specific question, and I'd be happy to answer if I can!

Expand full comment
Jul 4, 2023Liked by Sebastian Raschka, PhD

I may be wrong, but isn't the pseudocode you gave for the soft prompt tuning slightly misleading? The pseudocode omits the training that is necessary to arrive at a good value for the tunable part of the prompt. Instead, it seems to suggest that you're directly querying your model with the randomly initialized start value for the tunable prompt part, which clearly won't work.

Expand full comment
author

Ah, I think I see what you mean. Yeah, after initializing the soft prompt you need to train it. Just added an additional line to indicate that. Thanks for pointing that out!

Expand full comment

I think I'll have to read more about it. There's just so much LLM stuff going on that I haven't been in touch fully

Expand full comment