Discussion about this post

User's avatar
Jiada Li's avatar

I got quite confused for the classification of these attention mechanisms. I basically thought hybrid Attention are those MLA+Sparse Attention or SWA+GQA with different ratios or Gated Attention plus Delta Gate Attention or so....but your post states that Hybrid Attention is for the Transformer block replacement with linear Attention or State Space Module. It is a kind of Hybrid, but is more like a new design of the new LM Architecture. BTW, I feel like Kimi Attention Residuals can be part of the future Hybrid list, if AttnRes can be proved to be compatible with GQA or MLA then.

prcanegaly's avatar

Very dense content. Going thorough this slowly, to understand.

12 more comments...

No posts

Ready for more?