11 Comments
Jul 30, 2023Liked by Sebastian Raschka, PhD

An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:

Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.

Expand full comment

Thank you for mentioning my article, I am glad that you liked it. :-)

Expand full comment
Aug 3, 2023Liked by Sebastian Raschka, PhD

It is worth mentioning the Google Similarity Distance

https://arxiv.org/pdf/cs/0412098.pdf

Expand full comment
Jul 30, 2023Liked by Sebastian Raschka, PhD

Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.

What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?

Expand full comment

Towards Parameter-Free Data Mining, Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana,

KDD 2004

https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

Expand full comment