Share this comment
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
┬й 2025 Sebastian Raschka
Substack is the home for great culture
Agree, amazing discovery given that GZIP has been around for decades. This is like your old BlackBerry performing just as well as your new iPhone 14.
What about... If this GZIP approach works so well on raw feature space, what about trying it with LLM embedding vectors? What if the vector database folks (Pinecone, etc) zip those embeddings as an initial indexing stage for HNSW. Make any sense?
Thanks for suggesting, that's a super interesting idea. On a second thought, I am a bit skeptical because in the real vector space, similar text embeddings will result in different numbers (after the decimal point), which is where the gzip method may not work. I think in this case L2 distance or cosine distance would probably be much better.
PS: If you are interested in giving this a try, I suggest the `SentenceEncoder("distiluse-base-multilingual-cased-v2")` from the convenient "embetter" library: https://github.com/koaning/embetter
Thanks for the encouragement and the embetter repo hint. Reminds me of a simple LangChain for scikit folks. Also supports OpenAI & Cohere LLMs. Has interesting Bulk Labeling video https://www.youtube.com/watch?v=gDk7_f3ovIk
This article (and replies) from July stuck with me. You 'shot-me-down' with remark about 'after the decimal point'. And, you are correct. However, what about... binning/normalizing the d-dimensions of the embedding vector, as '00' to '99' or x'00' to x'FF' as strings? Do you bin globally (across all d-dim) or locally (for each dim)? Also, forget GZIP... For PCA, use the LoRank(?) tricks to calculate dot-products with byte values.