Table of Contents
Fetching ...

REWA: A General Theory of Witness-Based Similarity

Nikit Phadke

TL;DR

This work introduces REWA, a universal theory of witness-based similarity that reframes diverse similarity methods as functional witness projections over monoids. It proves an $O(\log N)$ encoding with ranking preservation under a $\Delta$-gap, leveraging 4-wise independent hashing and monotone witness aggregation. The authors instantiate REWA across Boolean, Natural, Real, and Tropical domains, connecting Bloom filters/LSH, Count-Min Sketch, Random Fourier Features, and graph-based shortest-paths with a single unifying mechanism. They also develop compositional, multi-channel encodings for hybrid search and discuss practical failure modes and defenses, positioning REWA as a foundation for future multi-modal retrieval systems.

Abstract

We present a universal framework for similarity-preserving encodings that subsumes all discrete, continuous, algebraic, and learned similarity methods under a single theoretical umbrella. By formulating similarity as functional witness projection over monoids, we prove that \[ O\!\left(\frac{1}{Δ^{2}}\log N\right) \] encoding complexity with ranking preservation holds for arbitrary algebraic structures. This unification reveals that Bloom filters, Locality Sensitive Hashing (LSH), Count-Min sketches, Random Fourier Features, and Transformer attention kernels are instances of the same underlying mechanism. We provide complete proofs with explicit constants under 4-wise independent hashing, handle heavy-tailed witnesses via normalization and clipping, and prove \[ O(\log N) \] complexity for all major similarity methods from 1970-2024. We give explicit constructions for Boolean, Natural, Real, Tropical, and Product monoids, prove tight concentration bounds, and demonstrate compositional properties enabling multi-primitive similarity systems.

REWA: A General Theory of Witness-Based Similarity

TL;DR

This work introduces REWA, a universal theory of witness-based similarity that reframes diverse similarity methods as functional witness projections over monoids. It proves an encoding with ranking preservation under a -gap, leveraging 4-wise independent hashing and monotone witness aggregation. The authors instantiate REWA across Boolean, Natural, Real, and Tropical domains, connecting Bloom filters/LSH, Count-Min Sketch, Random Fourier Features, and graph-based shortest-paths with a single unifying mechanism. They also develop compositional, multi-channel encodings for hybrid search and discuss practical failure modes and defenses, positioning REWA as a foundation for future multi-modal retrieval systems.

Abstract

We present a universal framework for similarity-preserving encodings that subsumes all discrete, continuous, algebraic, and learned similarity methods under a single theoretical umbrella. By formulating similarity as functional witness projection over monoids, we prove that encoding complexity with ranking preservation holds for arbitrary algebraic structures. This unification reveals that Bloom filters, Locality Sensitive Hashing (LSH), Count-Min sketches, Random Fourier Features, and Transformer attention kernels are instances of the same underlying mechanism. We provide complete proofs with explicit constants under 4-wise independent hashing, handle heavy-tailed witnesses via normalization and clipping, and prove complexity for all major similarity methods from 1970-2024. We give explicit constructions for Boolean, Natural, Real, Tropical, and Product monoids, prove tight concentration bounds, and demonstrate compositional properties enabling multi-primitive similarity systems.

Paper Structure

This paper contains 15 sections, 2 theorems, 3 equations, 2 tables.

Key Result

Theorem 4.1

Let $\mathcal{W}$ be an $(L, \alpha, \beta)$-monotone witness space. Assume the $\Delta$-gap condition holds: for a query $q$, the similarity gap between the true nearest neighbor and any non-neighbor is at least $\Delta$. If the hash functions are 4-wise independent, then for any failure probabilit where $C_M$ is a monoid-dependent constant (see Table tab:constants) and $\sigma^2$ is the variance

Theorems & Definitions (10)

  • Definition 2.1: Monoid
  • Definition 2.2: Hash Functions
  • Definition 3.1: Functional Witness Space
  • Definition 3.2: REWA Encoding
  • Definition 3.3: REWA Similarity
  • Definition 3.4: Witness Overlap & Monotonicity
  • Remark 3.5: Heavy Hitters
  • Theorem 4.1: Universal REWA Concentration
  • proof : Proof (Sketch)
  • Theorem 6.1: Product Monoid