Yes, you can say that. It's basically a learned relevance scorer. The top-k selector is what creates the actual mask (based on these scores) then. So together the indexer + selector are like a learned attention mask.
Have a question about MLA, 'As a side note, the queries are also compressed, but only during training, not inference.' From the Deepseek source code, the training or inference stage should be same? so, the queries in MLA should be compressed in inference either?
Wow, the part about DeepSeek V3.2 being open-weight and challenging proprietary models really resonatd; what are your thoughts on its future impact, your insights are always so valuable!
Love these details, thanks for the write up
Great summary!
Just a small note: The formatting needs some improvements, below "The technical report states that:" also maybe also "Examples include".
thanks for Sharing ideas
The lightening indexer basically creates a learned attention mask per token right?
Yes, you can say that. It's basically a learned relevance scorer. The top-k selector is what creates the actual mask (based on these scores) then. So together the indexer + selector are like a learned attention mask.
Have a question about MLA, 'As a side note, the queries are also compressed, but only during training, not inference.' From the Deepseek source code, the training or inference stage should be same? so, the queries in MLA should be compressed in inference either?
Wow, the part about DeepSeek V3.2 being open-weight and challenging proprietary models really resonatd; what are your thoughts on its future impact, your insights are always so valuable!
Fantastic article Sebastian!
me encanta tu trabajo pana, gracias y esperando el proximo.