13 Comments
User's avatar
Edward Grundy's avatar

Love these details, thanks for the write up

Maxtuti's avatar

amazing work!

HuangTing's avatar

The order of Figure 7 and Figure 8 is incorrect.

Sebastian Raschka, PhD's avatar

Thanks will update this when I am back at my computer.

Sebastian Raschka, PhD's avatar

Should be fixed now. Thanks again!

Daniel Kleine's avatar

Great summary!

Just a small note: The formatting needs some improvements, below "The technical report states that:" also maybe also "Examples include".

anik roy's avatar

thanks for Sharing ideas

Tim Dingman's avatar

The lightening indexer basically creates a learned attention mask per token right?

Sebastian Raschka, PhD's avatar

Yes, you can say that. It's basically a learned relevance scorer. The top-k selector is what creates the actual mask (based on these scores) then. So together the indexer + selector are like a learned attention mask.

Guorui Zheng's avatar

Have a question about MLA, 'As a side note, the queries are also compressed, but only during training, not inference.' From the Deepseek source code, the training or inference stage should be same? so, the queries in MLA should be compressed in inference either?

Rainbow Roxy's avatar

Wow, the part about DeepSeek V3.2 being open-weight and challenging proprietary models really resonatd; what are your thoughts on its future impact, your insights are always so valuable!

Ben Dickson's avatar

Fantastic article Sebastian!

Ghandy's avatar

me encanta tu trabajo pana, gracias y esperando el proximo.