Large Language Models and Nearest Neighbors

Jul 30, 2023

Instead of jumping on the latest trend of the week, I wanted to dive into a recent, fascinating application of nearest-neighbor methods in the context of large language models (LLMs) that made big waves in July.

Read →

13 Comments

Derek

Jul 30, 2023

An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:

Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.

Expand full comment

Reply (2)

Sebastian Raschka, PhD

Jul 30, 2023

That's a great point point. In practice, you could either add new training cases or keep a rolling dataset of the e.g., 25k latest examples.

Expand full comment

Vaclav Kosar

Jul 30, 2023

I seems Gzip similarity would not be affected by concept drift only because it cannot support any concept drift at all? New concept word instead of old word for the concept will increase Gzip size.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

Jul 30, 2023

I meant more like expanding or replacing the original training set with the new one containing the new concept words. If both the training and the new instances contain the new word, that should not increase the size.

Expand full comment

Abhinav Upadhyay

Aug 7, 2023

Thank you for mentioning my article, I am glad that you liked it. :-)

Expand full comment

Dmitry Tolstonogov

Aug 3, 2023

It is worth mentioning the Google Similarity Distance

https://arxiv.org/pdf/cs/0412098.pdf

Expand full comment

Richard Hackathorn

Jul 30, 2023

Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.

What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?

Expand full comment

Reply (2)

Sebastian Raschka, PhD

Jul 30, 2023

Thanks for suggesting, that's a super interesting idea. On a second thought, I am a bit skeptical because in the real vector space, similar text embeddings will result in different numbers (after the decimal point), which is where the gzip method may not work. I think in this case L2 distance or cosine distance would probably be much better.

PS: If you are interested in giving this a try, I suggest the `SentenceEncoder("distiluse-base-multilingual-cased-v2")` from the convenient "embetter" library: https://github.com/koaning/embetter

Expand full comment

Reply (1)

Richard Hackathorn

Jul 30, 2023

Thanks for the encouragement and the embetter repo hint. Reminds me of a simple LangChain for scikit folks. Also supports OpenAI & Cohere LLMs. Has interesting Bulk Labeling video https://www.youtube.com/watch?v=gDk7_f3ovIk

Expand full comment

Richard Hackathorn

Dec 31, 2023

This article (and replies) from July stuck with me. You 'shot-me-down' with remark about 'after the decimal point'. And, you are correct. However, what about... binning/normalizing the d-dimensions of the embedding vector, as '00' to '99' or x'00' to x'FF' as strings? Do you bin globally (across all d-dim) or locally (for each dim)? Also, forget GZIP... For PCA, use the LoRank(?) tricks to calculate dot-products with byte values.

Expand full comment

Aniruddha Saha

May 4Edited

Hi Sebastian, the arXiv link to the paper "Less is More: Parameter-Free Text Classification with Gzip" is incorrect. It should be https://arxiv.org/abs/2212.09410 instead.

Expand full comment

Reply (1)

Sebastian Raschka, PhD

May 5

Thanks for letting me know. I made a note to fix it when I am back at a computer hopefully later this week or next week.

Expand full comment

Dmitry Tolstonogov

Aug 3, 2023

Towards Parameter-Free Data Mining, Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana,

KDD 2004

https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

Expand full comment