The Information Theory of Similarity
Nikit Phadke
TL;DR
This work establishes a rigorous information-theoretic foundation for similarity search by proving a precise isomorphism between the REWA framework and Shannon theory. It shows that witness overlap corresponds to mutual information, that bit-precision bounds arise from channel capacity, and that ranking preservation aligns with rate-distortion optimization. The results yield fundamental limits (e.g., $m = O(\Delta^{-2}\log N)$) and provide a complete Shannon–REWA dictionary, unifying Bloom filters, LSH, neural retrieval, and related methods under a single theory. The framework assigns physical units to semantic similarity, reframes retrieval as a communication problem, and offers concrete design principles and optimality results with broad implications for the theory and practice of similarity search.
Abstract
We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's $O(Δ^{-2} \log N)$ complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.
