Instead of jumping on the latest trend of the week, I wanted to dive into a recent, fascinating application of nearest-neighbor methods in the context of large language models (LLMs) that made big waves in July. You may know that I like simple yet elegant and baselines, but I also found this method quite refreshing given that most of the current research is about scaling already massive LLMs. While they may not scale to all kinds of problems where LLMs currently excel, seemingly simple methods such as nearest neighbor algorithms have a certain beauty to them. It also shows that there are still many opportunities to innovate and make significant contributions based on foundational or "classic" techniques.
An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:
Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
An advantage we found back in the mid-noughties of using gzip+knn over other methods is it can handle concept drift without retraining:
Sarah Jane Delany and Derek Bridge: Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering, in Rosina O. Weber and Michael M. Richter (eds.), Case-Based Reasoning Research and Development (Procs. of the 7th International Conference on Case-Based Reasoning), LNAI 4626, Springer, pp.314-328, 2007.
Thank you for mentioning my article, I am glad that you liked it. :-)
It is worth mentioning the Google Similarity Distance
https://arxiv.org/pdf/cs/0412098.pdf
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
Towards Parameter-Free Data Mining, Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana,
KDD 2004
https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf