25 Comments
Jan 14Liked by Sebastian Raschka, PhD

I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)

As always, thanks a lot for a great tutorial!

Expand full comment
Jan 14Liked by Sebastian Raschka, PhD

keys = embedded_sentence @ W_query

values = embedded_sentence @ W_value

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

did you mean:

keys = embedded_sentence @ W_query.T

values = embedded_sentence @ W_value.T

Expand full comment
Apr 21Liked by Sebastian Raschka, PhD

Great Article. I was looking to broaden my understanding on self attention and I'm glad I stumbled on this.

Expand full comment
Apr 2Liked by Sebastian Raschka, PhD

Thanks a lot for this article! It cleared many concepts that I have been struggling in for a long time.

Expand full comment
Jan 21Liked by Sebastian Raschka, PhD

"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."

Since you added the argument, it should be *above*, not *on and above*

Thank you for the note!

Expand full comment
Jan 20Liked by Sebastian Raschka, PhD

I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?

Expand full comment
Jan 18Liked by Sebastian Raschka, PhD

Well explained!

Expand full comment
Jan 16·edited Jan 16Liked by Sebastian Raschka, PhD

Well written article. The concept of Queries, Keys and, Values, is finally clear to me.

Expand full comment
Jan 15Liked by Sebastian Raschka, PhD

In the diagram "Computing the normalized attention weights α", did you mean

α_2 = softmax(omega_2 / sqrt(d_k))

instead of

α_{2,i} = softmax(omega_{2,i} / sqrt(d_k))?

Since omega_{2,i} / sqrt(d_k) is a scalar whereas softmax operates on vectors.

Expand full comment
Jan 14·edited Jan 14

Hi Sebastian,

Thanks for the great article.

I've implemented the original Transformer architecture a couple of times already and every time I learn something new.

Just a quick note:

I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.

Expand full comment
Jan 15·edited Jan 17

Does anyone else have problems copying the code? Every time I copy-paste the code into a Jupyter NB, for instance

```

sentence = 'Life is short, eat dessert first'

dc = {s:i for i,s

in enumerate(sorted(sentence.replace(',', '').split()))}

```

it adds an invalid token "\u200b" at every empty line in the copied copy, in this example between the empty line between sentence and dc

Expand full comment