9 Comments
User's avatar
Edward Grundy's avatar

Love these details, thanks for the write up

Expand full comment
Daniel Kleine's avatar

Great summary!

Just a small note: The formatting needs some improvements, below "The technical report states that:" also maybe also "Examples include".

Expand full comment
anik roy's avatar

thanks for Sharing ideas

Expand full comment
Tim Dingman's avatar

The lightening indexer basically creates a learned attention mask per token right?

Expand full comment
Sebastian Raschka, PhD's avatar

Yes, you can say that. It's basically a learned relevance scorer. The top-k selector is what creates the actual mask (based on these scores) then. So together the indexer + selector are like a learned attention mask.

Expand full comment
Guorui Zheng's avatar

Have a question about MLA, 'As a side note, the queries are also compressed, but only during training, not inference.' From the Deepseek source code, the training or inference stage should be same? so, the queries in MLA should be compressed in inference either?

Expand full comment
Rainbow Roxy's avatar

Wow, the part about DeepSeek V3.2 being open-weight and challenging proprietary models really resonatd; what are your thoughts on its future impact, your insights are always so valuable!

Expand full comment
Ben Dickson's avatar

Fantastic article Sebastian!

Expand full comment
Ghandy's avatar

me encanta tu trabajo pana, gracias y esperando el proximo.

Expand full comment