This article will teach you about self-attention mechanisms used in transformer architectures and large language models (LLMs) such as GPT-4 and Llama. Self-attention and related mechanisms are core components of LLMs, making them a useful topic to understand when working with these models.
I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)
"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."
Since you added the argument, it should be *above*, not *on and above*
I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?
I've implemented the original Transformer architecture a couple of times already and every time I learn something new.
Just a quick note:
I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.
I believe the upper triable needs to start with the second diagonal to achieve the same mask as yours, i.e., mask = torch.triu(torch.ones((block_size, block_size)), diagonal=1)
As always, thanks a lot for a great tutorial!
keys = embedded_sentence @ W_query
values = embedded_sentence @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
did you mean:
keys = embedded_sentence @ W_query.T
values = embedded_sentence @ W_value.T
Great Article. I was looking to broaden my understanding on self attention and I'm glad I stumbled on this.
Thanks a lot for this article! It cleared many concepts that I have been struggling in for a long time.
"The masked_fill method then replaces all the elements on and above the diagonal via positive mask values (1s) with -torch.inf, with the results being shown below."
Since you added the argument, it should be *above*, not *on and above*
Thank you for the note!
I was wondering how backpropagation works with an encoder-decoder transformer? For example, is it necessary to define two optimisers, one for the encoder and one for the decoder? What are the target values of the encoder network when the parameters need to be trained? To summarise: Would backpropagation for training the transformer also have to be performed for the encoder network or only for the decoder network, and if so, how?
Well explained!
Well written article. The concept of Queries, Keys and, Values, is finally clear to me.
In the diagram "Computing the normalized attention weights α", did you mean
α_2 = softmax(omega_2 / sqrt(d_k))
instead of
α_{2,i} = softmax(omega_{2,i} / sqrt(d_k))?
Since omega_{2,i} / sqrt(d_k) is a scalar whereas softmax operates on vectors.
Hi Sebastian,
Thanks for the great article.
I've implemented the original Transformer architecture a couple of times already and every time I learn something new.
Just a quick note:
I think mask = torch.triu(torch.ones(block_size, block_size)) should be mask = torch.triu(torch.ones(block_size, block_size), diagonal=1), otherwise the values on the diagonal get masked as well.
Does anyone else have problems copying the code? Every time I copy-paste the code into a Jupyter NB, for instance
```
sentence = 'Life is short, eat dessert first'
dc = {s:i for i,s
in enumerate(sorted(sentence.replace(',', '').split()))}
```
it adds an invalid token "\u200b" at every empty line in the copied copy, in this example between the empty line between sentence and dc