27 Comments
Jan 14Liked by Sebastian Raschka, PhD

I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)

As always, thanks a lot for a great tutorial!

Expand full comment
author

Good catch, the output should stay the same -- the diagonal=1 must have gone missing when copying things over from my local code. Fixed!

Expand full comment
Jan 14Liked by Sebastian Raschka, PhD

keys = embedded_sentence @ W_query

values = embedded_sentence @ W_value

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

did you mean:

keys = embedded_sentence @ W_query.T

values = embedded_sentence @ W_value.T

Expand full comment
author
Jan 14·edited Jan 14Author

Thanks for the note. Should have been `keys = embedded_sentence @ W_key` instead of `keys = embedded_sentence @ W_query`. Just updated it.

Expand full comment
Jan 14Liked by Sebastian Raschka, PhD

keys = embedded_sentence @ W_keys should be @ W_key.T ? Otherwise, dimensions don't match

Expand full comment
author

Oh I see what you mean. Thanks. `keys = embedded_sentence @ W_query` should be correct -- I rearranged the weight matrix init a bit and I think it didn't take the update previously. I.e., the weight matrices were supposed to be initialized as

torch.nn.Parameter(torch.rand(d, d_q))

rather than

torch.nn.Parameter(torch.rand(d_q, d))

to avoid the transpose in several places, which makes everything a bit more straightforward. I refreshed it and it seems to be fine now.

Expand full comment

That's the last typo, I was able to 'QA' the rest of the code in a Jupiter notebook.

Expand full comment

Thank you for the great article

Expand full comment
May 8Liked by Sebastian Raschka, PhD

Very Nice instruction!

Expand full comment
author

Thanks!

Expand full comment
Apr 21Liked by Sebastian Raschka, PhD

Great Article. I was looking to broaden my understanding on self attention and I'm glad I stumbled on this.

Expand full comment
Apr 2Liked by Sebastian Raschka, PhD

Thanks a lot for this article! It cleared many concepts that I have been struggling in for a long time.

Expand full comment
Jan 21Liked by Sebastian Raschka, PhD

"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."

Since you added the argument, it should be *above*, not *on and above*

Thank you for the note!

Expand full comment
author

Absolutely right. Good catch. I removed the "on".

Expand full comment
Jan 20Liked by Sebastian Raschka, PhD

I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?

Expand full comment
author
Jan 20·edited Jan 21Author

Good question. In contrast to a GAN, you actually only need one optimizer. The reason is that the encoder and decoder are both part of the same architecture. I.e., the encoder output feeds into the decoder. Instead of thinking of it as two different independent neural networks, it perhaps helps to think of them just as layers.

In PyTorch, the overall structure could look like as follows:

EDIT: Code formatting doesn't seem to be supported in the comments, so let me add a link: https://gist.github.com/rasbt/4c32fac33a6641b1fb608718e2a51500

Expand full comment
Jan 18Liked by Sebastian Raschka, PhD

Well explained!

Expand full comment
Jan 16·edited Jan 16Liked by Sebastian Raschka, PhD

Well written article. The concept of Queries, Keys and, Values, is finally clear to me.

Expand full comment
Jan 15Liked by Sebastian Raschka, PhD

In the diagram "Computing the normalized attention weights α", did you mean

α_2 = softmax(omega_2 / sqrt(d_k))

instead of

α_{2,i} = softmax(omega_{2,i} / sqrt(d_k))?

Since omega_{2,i} / sqrt(d_k) is a scalar whereas softmax operates on vectors.

Expand full comment
author

Good call, the "i" shouldn't be there. Just updated it. Thanks!

Expand full comment
Jan 15·edited Jan 15

there is one more minor typo, a duplicated capital i that needs to be removed in the text:

"IIn this article, ..." -> "In this article, ..."

Expand full comment
Jan 14·edited Jan 14

Hi Sebastian,

Thanks for the great article.

I've implemented the original Transformer architecture a couple of times already and every time I learn something new.

Just a quick note:

I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.

Expand full comment

Has been fixed, someone else reported that issue as well, see above in the comments

Expand full comment
Jan 15·edited Jan 17

Does anyone else have problems copying the code? Every time I copy-paste the code into a Jupyter NB, for instance

```

sentence = 'Life is short, eat dessert first'

dc = {s:i for i,s

in enumerate(sorted(sentence.replace(',', '').split()))}

```

it adds an invalid token "\u200b" at every empty line in the copied copy, in this example between the empty line between sentence and dc

Expand full comment
author

Arg, sorry, I can confirm that I'm getting the same there too. If you open the Jupyter notebook in VSCode though it highlights these invisible invalid characters and you can do a "Find & Replace All" on them. (I wish Substack let me post a screenshot to demonstrate)

Expand full comment
Jan 17Liked by Sebastian Raschka, PhD

I see it more like a feature than a bug 😁.

It forces me to actually type the code and therefore having a better understanding of the concepts than I would have by simply copy-pasting it.

Expand full comment

Yes, I encountered the same issue.

Expand full comment