40 Comments

I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)

As always, thanks a lot for a great tutorial!

Expand full comment

Good catch, the output should stay the same -- the diagonal=1 must have gone missing when copying things over from my local code. Fixed!

Expand full comment

keys = embedded_sentence @ W_query

values = embedded_sentence @ W_value

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

did you mean:

keys = embedded_sentence @ W_query.T

values = embedded_sentence @ W_value.T

Expand full comment

Thanks for the note. Should have been `keys = embedded_sentence @ W_key` instead of `keys = embedded_sentence @ W_query`. Just updated it.

Expand full comment

keys = embedded_sentence @ W_keys should be @ W_key.T ? Otherwise, dimensions don't match

Expand full comment

Oh I see what you mean. Thanks. `keys = embedded_sentence @ W_query` should be correct -- I rearranged the weight matrix init a bit and I think it didn't take the update previously. I.e., the weight matrices were supposed to be initialized as

torch.nn.Parameter(torch.rand(d, d_q))

rather than

torch.nn.Parameter(torch.rand(d_q, d))

to avoid the transpose in several places, which makes everything a bit more straightforward. I refreshed it and it seems to be fine now.

Expand full comment

That's the last typo, I was able to 'QA' the rest of the code in a Jupiter notebook.

Expand full comment

Thank you for the great article

Expand full comment

Hi. I have two stupid questions! (just to assure myself that I understood everything correctly!) and I will be so grateful if you answer them.

1) In the main block of self-attention code, "x" is the input matrix, so each row represents the embedding of each token? (that's why the number of rows is 512 for this matrix_because of the token limitation of BERT-based networks_).

2) In multi-head attention, each head has its own W matrices, yes? Now, to guarantee that the sum of output dimensions (using concatenation) is equal to 768 (here, I suppose that the embedding dimension of x matrix is 768), do we reduce the embedding dimension of the input matrices in the first place (when making K and Q vectors)? or we adjust the "d_v" dimension of the W_d matrix such that: d_v = dimension of hidden state (here : 768)/ number of heads. ?!

Many thanks in advance

Expand full comment

Hey there

regarding 1), you are correct. And the 512 limitation you mentioned here would be the number of rows, or how many tokens the model supports. Note that in practice there is typically also a batch dimension added, which is for the number of samples. So it becomes a 3D tensor with dimensions: [batch_size, num_tokens, embedding_dim]. I have more advanced implementations of multihead attention here if you are interested: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb

2) Yes, each has its own matrix. The reduction depends on how you implement it. However, like you point out, a common variant is `head_dim = embedding_dim // num_heads` (see also the link above for the implementations).

In any case, you are on the right track here :)

Expand full comment

I want to say, SO CLEAR! SO GREAT! Why I found this so Late, Thank you Dr. Sebastian Raschka.

Expand full comment

Thanks for the kind words!

Expand full comment

> Note that in cross-attention, the two input sequences x_1 and x_2 can have different numbers of elements. However, their embedding dimensions must match.

This is not obvious to me. All formulae in this post seem to work out without needing this. Could you clarify?

Expand full comment

Hi there,

yes, that's a good point. It would be sufficient if the embedding dimension of Q and K match, but it doesn't have to match in x_1 and x_2

Expand full comment

Right, yes exactly. Thank you so much for your reply and apologies for my late acknowledgement!

Expand full comment

Great article - really well written and clear! Also really enjoyed some of the other articles like DoRA.

I am currently writing a Medium article and would like to link this article as a great follow up resource. Would you mind if I used one of your great self-attention visualisations (with proper attribution)?

Expand full comment

Glad you found it useful! And yes, I am happy to give permission to reshare the figures given that they include a notice about the source below each figure. For instance: "Image Source: Sebastian Raschka, https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention"

Expand full comment

Great thank you! Cited it as you recommended https://medium.com/p/bb9b071e2238#b0ab-eedf54532079

Expand full comment

Very Nice instruction!

Expand full comment

Thanks!

Expand full comment

Great Article. I was looking to broaden my understanding on self attention and I'm glad I stumbled on this.

Expand full comment

Thanks a lot for this article! It cleared many concepts that I have been struggling in for a long time.

Expand full comment

"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."

Since you added the argument, it should be *above*, not *on and above*

Thank you for the note!

Expand full comment

Absolutely right. Good catch. I removed the "on".

Expand full comment

I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?

Expand full comment

Good question. In contrast to a GAN, you actually only need one optimizer. The reason is that the encoder and decoder are both part of the same architecture. I.e., the encoder output feeds into the decoder. Instead of thinking of it as two different independent neural networks, it perhaps helps to think of them just as layers.

In PyTorch, the overall structure could look like as follows:

EDIT: Code formatting doesn't seem to be supported in the comments, so let me add a link: https://gist.github.com/rasbt/4c32fac33a6641b1fb608718e2a51500

Expand full comment

Well explained!

Expand full comment

Well written article. The concept of Queries, Keys and, Values, is finally clear to me.

Expand full comment

In the diagram "Computing the normalized attention weights α", did you mean

α_2 = softmax(omega_2 / sqrt(d_k))

instead of

α_{2,i} = softmax(omega_{2,i} / sqrt(d_k))?

Since omega_{2,i} / sqrt(d_k) is a scalar whereas softmax operates on vectors.

Expand full comment

Good call, the "i" shouldn't be there. Just updated it. Thanks!

Expand full comment

there is one more minor typo, a duplicated capital i that needs to be removed in the text:

"IIn this article, ..." -> "In this article, ..."

Expand full comment

Hi Sebastian,

Thanks for the great article.

I've implemented the original Transformer architecture a couple of times already and every time I learn something new.

Just a quick note:

I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.

Expand full comment

Has been fixed, someone else reported that issue as well, see above in the comments

Expand full comment

Hi. Thank you for your great article.

I think there is a typo in the last (and the most general) code of self attention.

Where you wanted to scale dot products, you divided the not scaled matrix by d_kq instead of d_v.

Am i right?

Expand full comment

Thanks for the comment. As far as I know (e.g., based on the original transformer paper, "Attention Is All You Need", https://arxiv.org/pdf/1706.03762), the division by `d_kq` should be correct. I.e., see Eq (1) on page 4 in the paper.

I am curious why you think otherwise (did I have a misleading sentence elsewhere in the text)?

Expand full comment

No, sorry, It was my mistake!

Actually, in the text, you mentioned "dk" ... but inside the code, you wrote "dkq", since the latter has two letters, I thought it was wrong! ( I know that dk = dq, at that time, my inference was that in the text you have written only "dv" and "dkq". I forgot that you also wrote "dk" and "dq" separately).

Expand full comment