Table of Contents
Fetching ...

The Information Theory of Similarity

Nikit Phadke

TL;DR

This work establishes a rigorous information-theoretic foundation for similarity search by proving a precise isomorphism between the REWA framework and Shannon theory. It shows that witness overlap corresponds to mutual information, that bit-precision bounds arise from channel capacity, and that ranking preservation aligns with rate-distortion optimization. The results yield fundamental limits (e.g., $m = O(\Delta^{-2}\log N)$) and provide a complete Shannon–REWA dictionary, unifying Bloom filters, LSH, neural retrieval, and related methods under a single theory. The framework assigns physical units to semantic similarity, reframes retrieval as a communication problem, and offers concrete design principles and optimality results with broad implications for the theory and practice of similarity search.

Abstract

We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's $O(Δ^{-2} \log N)$ complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.

The Information Theory of Similarity

TL;DR

This work establishes a rigorous information-theoretic foundation for similarity search by proving a precise isomorphism between the REWA framework and Shannon theory. It shows that witness overlap corresponds to mutual information, that bit-precision bounds arise from channel capacity, and that ranking preservation aligns with rate-distortion optimization. The results yield fundamental limits (e.g., ) and provide a complete Shannon–REWA dictionary, unifying Bloom filters, LSH, neural retrieval, and related methods under a single theory. The framework assigns physical units to semantic similarity, reframes retrieval as a communication problem, and offers concrete design principles and optimality results with broad implications for the theory and practice of similarity search.

Abstract

We establish a precise mathematical equivalence between witness-based similarity systems (REWA) and Shannon's information theory. We prove that witness overlap is mutual information, that REWA bit complexity bounds arise from channel capacity limitations, and that ranking-preserving encodings obey rate-distortion constraints. This unification reveals that fifty years of similarity search research -- from Bloom filters to locality-sensitive hashing to neural retrieval -- implicitly developed information theory for relational data. We derive fundamental lower bounds showing that REWA's complexity is optimal: no encoding scheme can preserve similarity rankings with fewer bits. The framework establishes that semantic similarity has physical units (bits of mutual information), search is communication (query transmission over a noisy channel), and retrieval systems face fundamental capacity limits analogous to Shannon's channel coding theorem.

Paper Structure

This paper contains 27 sections, 7 theorems, 45 equations, 1 table.

Key Result

Theorem 4.1

Let $x, y$ be concepts with witness distributions $p_x, p_y$ over universe $\Omega$. Define the joint witness process $(W_x, W_y)$ with joint distribution: where $\kappa(w, w') = \mathbf{1}[w = w']$ for exact matching. Then the mutual information $I(W_x; W_y)$ is a monotonically increasing function of the normalized overlap: for a strictly increasing function $f: [0,1] \to \mathbb{R}_{\geq 0}$ w

Theorems & Definitions (30)

  • Definition 2.1: Entropy
  • Definition 2.2: Mutual Information
  • Definition 2.3: Channel Capacity
  • Definition 2.4: Rate-Distortion Function
  • Definition 2.5: Witness Sets
  • Definition 2.6: Witness Overlap
  • Definition 2.7: REWA Encoding
  • Definition 2.8: Overlap Gap Condition
  • Definition 3.1: Concept as Random Variable
  • Example 3.2: Boolean REWA
  • ...and 20 more