This article will teach you about self-attention mechanisms used in transformer architectures and large language models (LLMs) such as GPT-4 and Llama.

I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)

Oh I see what you mean. Thanks. `keys = embedded_sentence @ W_query` should be correct -- I rearranged the weight matrix init a bit and I think it didn't take the update previously. I.e., the weight matrices were supposed to be initialized as

torch.nn.Parameter(torch.rand(d, d_q))

rather than

torch.nn.Parameter(torch.rand(d_q, d))

to avoid the transpose in several places, which makes everything a bit more straightforward. I refreshed it and it seems to be fine now.

Oct 28·edited Oct 28Liked by Sebastian Raschka, PhD

Hi. I have two stupid questions! (just to assure myself that I understood everything correctly!) and I will be so grateful if you answer them.

1) In the main block of self-attention code, "x" is the input matrix, so each row represents the embedding of each token? (that's why the number of rows is 512 for this matrix_because of the token limitation of BERT-based networks_).

2) In multi-head attention, each head has its own W matrices, yes? Now, to guarantee that the sum of output dimensions (using concatenation) is equal to 768 (here, I suppose that the embedding dimension of x matrix is 768), do we reduce the embedding dimension of the input matrices in the first place (when making K and Q vectors)? or we adjust the "d_v" dimension of the W_d matrix such that: d_v = dimension of hidden state (here : 768)/ number of heads. ?!

regarding 1), you are correct. And the 512 limitation you mentioned here would be the number of rows, or how many tokens the model supports. Note that in practice there is typically also a batch dimension added, which is for the number of samples. So it becomes a 3D tensor with dimensions: [batch_size, num_tokens, embedding_dim]. I have more advanced implementations of multihead attention here if you are interested: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb

2) Yes, each has its own matrix. The reduction depends on how you implement it. However, like you point out, a common variant is `head_dim = embedding_dim // num_heads` (see also the link above for the implementations).

> Note that in cross-attention, the two input sequences x_1 and x_2 can have different numbers of elements. However, their embedding dimensions must match.

This is not obvious to me. All formulae in this post seem to work out without needing this. Could you clarify?

Aug 13·edited Aug 13Liked by Sebastian Raschka, PhD

Great article - really well written and clear! Also really enjoyed some of the other articles like DoRA.

I am currently writing a Medium article and would like to link this article as a great follow up resource. Would you mind if I used one of your great self-attention visualisations (with proper attribution)?

"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."

Since you added the argument, it should be *above*, not *on and above*

I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?

Good question. In contrast to a GAN, you actually only need one optimizer. The reason is that the encoder and decoder are both part of the same architecture. I.e., the encoder output feeds into the decoder. Instead of thinking of it as two different independent neural networks, it perhaps helps to think of them just as layers.

In PyTorch, the overall structure could look like as follows:

I've implemented the original Transformer architecture a couple of times already and every time I learn something new.

Just a quick note:

I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.

Thanks for the comment. As far as I know (e.g., based on the original transformer paper, "Attention Is All You Need", https://arxiv.org/pdf/1706.03762), the division by `d_kq` should be correct. I.e., see Eq (1) on page 4 in the paper.

I am curious why you think otherwise (did I have a misleading sentence elsewhere in the text)?

Actually, in the text, you mentioned "dk" ... but inside the code, you wrote "dkq", since the latter has two letters, I thought it was wrong! ( I know that dk = dq, at that time, my inference was that in the text you have written only "dv" and "dkq". I forgot that you also wrote "dk" and "dq" separately).

I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)

As always, thanks a lot for a great tutorial!

Good catch, the output should stay the same -- the diagonal=1 must have gone missing when copying things over from my local code. Fixed!

keys = embedded_sentence @ W_query

values = embedded_sentence @ W_value

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

did you mean:

keys = embedded_sentence @ W_query.T

values = embedded_sentence @ W_value.T

edited Jan 14Thanks for the note. Should have been `keys = embedded_sentence @ W_key` instead of `keys = embedded_sentence @ W_query`. Just updated it.

keys = embedded_sentence @ W_keys should be @ W_key.T ? Otherwise, dimensions don't match

Oh I see what you mean. Thanks. `keys = embedded_sentence @ W_query` should be correct -- I rearranged the weight matrix init a bit and I think it didn't take the update previously. I.e., the weight matrices were supposed to be initialized as

torch.nn.Parameter(torch.rand(d, d_q))

rather than

torch.nn.Parameter(torch.rand(d_q, d))

to avoid the transpose in several places, which makes everything a bit more straightforward. I refreshed it and it seems to be fine now.

That's the last typo, I was able to 'QA' the rest of the code in a Jupiter notebook.

Thank you for the great article

edited Oct 28Hi. I have two stupid questions! (just to assure myself that I understood everything correctly!) and I will be so grateful if you answer them.

1) In the main block of self-attention code, "x" is the input matrix, so each row represents the embedding of each token? (that's why the number of rows is 512 for this matrix_because of the token limitation of BERT-based networks_).

2) In multi-head attention, each head has its own W matrices, yes? Now, to guarantee that the sum of output dimensions (using concatenation) is equal to 768 (here, I suppose that the embedding dimension of x matrix is 768), do we reduce the embedding dimension of the input matrices in the first place (when making K and Q vectors)? or we adjust the "d_v" dimension of the W_d matrix such that: d_v = dimension of hidden state (here : 768)/ number of heads. ?!

Many thanks in advance

Hey there

regarding 1), you are correct. And the 512 limitation you mentioned here would be the number of rows, or how many tokens the model supports. Note that in practice there is typically also a batch dimension added, which is for the number of samples. So it becomes a 3D tensor with dimensions: [batch_size, num_tokens, embedding_dim]. I have more advanced implementations of multihead attention here if you are interested: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb

2) Yes, each has its own matrix. The reduction depends on how you implement it. However, like you point out, a common variant is `head_dim = embedding_dim // num_heads` (see also the link above for the implementations).

In any case, you are on the right track here :)

I want to say, SO CLEAR! SO GREAT! Why I found this so Late, Thank you Dr. Sebastian Raschka.

Thanks for the kind words!

> Note that in cross-attention, the two input sequences x_1 and x_2 can have different numbers of elements. However, their embedding dimensions must match.

This is not obvious to me. All formulae in this post seem to work out without needing this. Could you clarify?

Hi there,

yes, that's a good point. It would be sufficient if the embedding dimension of Q and K match, but it doesn't have to match in x_1 and x_2

Right, yes exactly. Thank you so much for your reply and apologies for my late acknowledgement!

edited Aug 13Great article - really well written and clear! Also really enjoyed some of the other articles like DoRA.

I am currently writing a Medium article and would like to link this article as a great follow up resource. Would you mind if I used one of your great self-attention visualisations (with proper attribution)?

Glad you found it useful! And yes, I am happy to give permission to reshare the figures given that they include a notice about the source below each figure. For instance: "Image Source: Sebastian Raschka, https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention"

Great thank you! Cited it as you recommended https://medium.com/p/bb9b071e2238#b0ab-eedf54532079

Very Nice instruction!

Thanks!

Great Article. I was looking to broaden my understanding on self attention and I'm glad I stumbled on this.

Thanks a lot for this article! It cleared many concepts that I have been struggling in for a long time.

"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."

Since you added the argument, it should be *above*, not *on and above*

Thank you for the note!

Absolutely right. Good catch. I removed the "on".

I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?

edited Jan 21Good question. In contrast to a GAN, you actually only need one optimizer. The reason is that the encoder and decoder are both part of the same architecture. I.e., the encoder output feeds into the decoder. Instead of thinking of it as two different independent neural networks, it perhaps helps to think of them just as layers.

In PyTorch, the overall structure could look like as follows:

EDIT: Code formatting doesn't seem to be supported in the comments, so let me add a link: https://gist.github.com/rasbt/4c32fac33a6641b1fb608718e2a51500

Well explained!

edited Jan 16Well written article. The concept of Queries, Keys and, Values, is finally clear to me.

In the diagram "Computing the normalized attention weights α", did you mean

α_2 = softmax(omega_2 / sqrt(d_k))

instead of

α_{2,i} = softmax(omega_{2,i} / sqrt(d_k))?

Since omega_{2,i} / sqrt(d_k) is a scalar whereas softmax operates on vectors.

Good call, the "i" shouldn't be there. Just updated it. Thanks!

edited Jan 15there is one more minor typo, a duplicated capital i that needs to be removed in the text:

"IIn this article, ..." -> "In this article, ..."

edited Jan 14Hi Sebastian,

Thanks for the great article.

I've implemented the original Transformer architecture a couple of times already and every time I learn something new.

Just a quick note:

I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.

Has been fixed, someone else reported that issue as well, see above in the comments

Hi. Thank you for your great article.

I think there is a typo in the last (and the most general) code of self attention.

Where you wanted to scale dot products, you divided the not scaled matrix by d_kq instead of d_v.

Am i right?

Thanks for the comment. As far as I know (e.g., based on the original transformer paper, "Attention Is All You Need", https://arxiv.org/pdf/1706.03762), the division by `d_kq` should be correct. I.e., see Eq (1) on page 4 in the paper.

I am curious why you think otherwise (did I have a misleading sentence elsewhere in the text)?

No, sorry, It was my mistake!

Actually, in the text, you mentioned "dk" ... but inside the code, you wrote "dkq", since the latter has two letters, I thought it was wrong! ( I know that dk = dq, at that time, my inference was that in the text you have written only "dv" and "dkq". I forgot that you also wrote "dk" and "dq" separately).