11 Comments
Jul 30, 2023Liked by Sebastian Raschka, PhD

An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:

Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.

Expand full comment
author

That's a great point point. In practice, you could either add new training cases or keep a rolling dataset of the e.g., 25k latest examples.

Expand full comment

I seems Gzip similarity would not be affected by concept drift only because it cannot support any concept drift at all? New concept word instead of old word for the concept will increase Gzip size.

Expand full comment
author

I meant more like expanding or replacing the original training set with the new one containing the new concept words. If both the training and the new instances contain the new word, that should not increase the size.

Expand full comment

Thank you for mentioning my article, I am glad that you liked it. :-)

Expand full comment
Aug 3, 2023Liked by Sebastian Raschka, PhD

It is worth mentioning the Google Similarity Distance

https://arxiv.org/pdf/cs/0412098.pdf

Expand full comment
Jul 30, 2023Liked by Sebastian Raschka, PhD

Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.

What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?

Expand full comment
author

Thanks for suggesting, that's a super interesting idea. On a second thought, I am a bit skeptical because in the real vector space, similar text embeddings will result in different numbers (after the decimal point), which is where the gzip method may not work. I think in this case L2 distance or cosine distance would probably be much better.

PS: If you are interested in giving this a try, I suggest the `SentenceEncoder("distiluse-base-multilingual-cased-v2")` from the convenient "embetter" library: https://github.com/koaning/embetter

Expand full comment
Jul 30, 2023Liked by Sebastian Raschka, PhD

Thanks for the encouragement and the embetter repo hint. Reminds me of a simple LangChain for scikit folks. Also supports OpenAI & Cohere LLMs. Has interesting Bulk Labeling video https://www.youtube.com/watch?v=gDk7_f3ovIk

Expand full comment

This article (and replies) from July stuck with me. You 'shot-me-down' with remark about 'after the decimal point'. And, you are correct. However, what about... binning/normalizing the d-dimensions of the embedding vector, as '00' to '99' or x'00' to x'FF' as strings? Do you bin globally (across all d-dim) or locally (for each dim)? Also, forget GZIP... For PCA, use the LoRank(?) tricks to calculate dot-products with byte values.

Expand full comment

Towards Parameter-Free Data Mining, Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana,

KDD 2004

https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

Expand full comment