Discussion about this post

User's avatar
Kai Liu's avatar

Great article, Sebastian! Thank you for your work!

Expand full comment
tzc's avatar

Thank you very much for this post!

I have one question about the code in Section 2.5 (gated attention). I thought the gate and the context would have the same shape, but in line 100 the context is (b, num_tokens, self.num_heads, self.head_dim) while the gate is (b, self.num_heads, num_tokens, self.head_dim).

Is this difference intentional, or might it be a small mistake?

Thanks again for sharing!

Expand full comment
21 more comments...

No posts

Ready for more?