Instead of jumping on the latest trend of the week, I wanted to dive into a recent, fascinating application of nearest-neighbor methods in the context of large language models (LLMs) that made big waves in July.
An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:
Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.
I seems Gzip similarity would not be affected by concept drift only because it cannot support any concept drift at all? New concept word instead of old word for the concept will increase Gzip size.
I meant more like expanding or replacing the original training set with the new one containing the new concept words. If both the training and the new instances contain the new word, that should not increase the size.
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
Thanks for suggesting, that's a super interesting idea. On a second thought, I am a bit skeptical because in the real vector space, similar text embeddings will result in different numbers (after the decimal point), which is where the gzip method may not work. I think in this case L2 distance or cosine distance would probably be much better.
PS: If you are interested in giving this a try, I suggest the `SentenceEncoder("distiluse-base-multilingual-cased-v2")` from the convenient "embetter" library: https://github.com/koaning/embetter
Thanks for the encouragement and the embetter repo hint. Reminds me of a simple LangChain for scikit folks. Also supports OpenAI & Cohere LLMs. Has interesting Bulk Labeling video https://www.youtube.com/watch?v=gDk7_f3ovIk
This article (and replies) from July stuck with me. You 'shot-me-down' with remark about 'after the decimal point'. And, you are correct. However, what about... binning/normalizing the d-dimensions of the embedding vector, as '00' to '99' or x'00' to x'FF' as strings? Do you bin globally (across all d-dim) or locally (for each dim)? Also, forget GZIP... For PCA, use the LoRank(?) tricks to calculate dot-products with byte values.
An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:
Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.
That's a great point point. In practice, you could either add new training cases or keep a rolling dataset of the e.g., 25k latest examples.
I seems Gzip similarity would not be affected by concept drift only because it cannot support any concept drift at all? New concept word instead of old word for the concept will increase Gzip size.
I meant more like expanding or replacing the original training set with the new one containing the new concept words. If both the training and the new instances contain the new word, that should not increase the size.
Thank you for mentioning my article, I am glad that you liked it. :-)
It is worth mentioning the Google Similarity Distance
https://arxiv.org/pdf/cs/0412098.pdf
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
Thanks for suggesting, that's a super interesting idea. On a second thought, I am a bit skeptical because in the real vector space, similar text embeddings will result in different numbers (after the decimal point), which is where the gzip method may not work. I think in this case L2 distance or cosine distance would probably be much better.
PS: If you are interested in giving this a try, I suggest the `SentenceEncoder("distiluse-base-multilingual-cased-v2")` from the convenient "embetter" library: https://github.com/koaning/embetter
Thanks for the encouragement and the embetter repo hint. Reminds me of a simple LangChain for scikit folks. Also supports OpenAI & Cohere LLMs. Has interesting Bulk Labeling video https://www.youtube.com/watch?v=gDk7_f3ovIk
This article (and replies) from July stuck with me. You 'shot-me-down' with remark about 'after the decimal point'. And, you are correct. However, what about... binning/normalizing the d-dimensions of the embedding vector, as '00' to '99' or x'00' to x'FF' as strings? Do you bin globally (across all d-dim) or locally (for each dim)? Also, forget GZIP... For PCA, use the LoRank(?) tricks to calculate dot-products with byte values.
Towards Parameter-Free Data Mining, Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana,
KDD 2004
https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf