Improving LSH via Tensorized Random Projection

Bhisham Dev Verma; Rameshwar Pratap

Improving LSH via Tensorized Random Projection

Bhisham Dev Verma, Rameshwar Pratap

TL;DR

The paper addresses the prohibitive exponential scaling when applying traditional LSH to high-order tensors by avoiding reshaping tensors into vectors and instead using tensorized random projections based on CP and Tensor Train decompositions. It introduces four methods—CP-E2LSH, TT-E2LSH, CP-SRP, and TT-SRP—that project tensors onto low-rank CP or TT projection tensors, followed by discretization or sign hashing, and proves asymptotic Gaussianity of the projections to establish LSH guarantees for Euclidean distance and cosine similarity. The key contributions include space complexities of $O(N d R)$ (CP) and $O(N d R^2)$ (TT), along with time complexities that scale favorably when inputs are provided in CP/TT formats, and rigorous theoretical guarantees for collision probabilities aligning with standard LSH behavior. The proposed tensorized approach enables scalable, practical near-neighbor search on multidimensional data while preserving the fundamental LSH properties, with broad impact for mining and processing large tensor-structured datasets.

Abstract

Locality sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large scale data processing applications such as near duplicate detection, nearest neighbour search, clustering, etc. In this work, we aim to propose faster and space efficient locality sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data $E2LSH$ and $SRP$. However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely $CP-E2LSH$, $TT-E2LSH$, and $CP-SRP$, $TT-SRP$, respectively, building on $CP$ and tensor train $(TT)$ decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank $CP$ or $TT$ tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.

Improving LSH via Tensorized Random Projection

TL;DR

(CP) and

(TT), along with time complexities that scale favorably when inputs are provided in CP/TT formats, and rigorous theoretical guarantees for collision probabilities aligning with standard LSH behavior. The proposed tensorized approach enables scalable, practical near-neighbor search on multidimensional data while preserving the fundamental LSH properties, with broad impact for mining and processing large tensor-structured datasets.

Abstract

and

. However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely

, and

, respectively, building on

and tensor train

decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank

tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.

Paper Structure (19 sections, 11 theorems, 90 equations, 2 tables)

This paper contains 19 sections, 11 theorems, 90 equations, 2 tables.

Introduction
Related Work
LHS for Euclidean distance:
LSH for Cosine Similarity:
Background
Locality Sensitive Hashing (LSH):
Sign Random Projection (SRP/SimHash)
E2LSH datar2004locality
Tensors
Tensorized random projection:
Central Limit Theorems
Analysis
Tensorized E2LSH
CP-E2LSH
TT-E2LSH
...and 4 more sections

Key Result

Theorem 1

janson1988normal Let $\{X_{1}, \ldots, X_{d}\}$ be a family of bounded random variables, i.e. $|X_{i}| \leq A$. Suppose that $\Gamma_{d}$ is a dependency graph for this family and let $M$ be the maximal degree of $\Gamma_{d}$ (if $\Gamma_{d}$ has no edges, in that case, we set $M=1$). Let $S_{d} = \ where $\overset{\mathcal{D}}{\to}$ indicates the convergence in distribution.

Theorems & Definitions (40)

Definition 1
Definition 2
Definition 3
Definition 4: CP Decomposition kolda2009tensor
Definition 5: TT Decomposition oseledets2011tensor
Definition 6: CP-Rademacher Distributed Tensor rakhshan2021rademacher
Definition 7: TT-Rademacher Distributed Tensor rakhshan2021rademacher
Definition 8: CP Rademacher Random Projection rakhshan2020tensorizedrakhshan2021rademacher
Definition 9: TT Radmecher Random Projection rakhshan2020tensorizedrakhshan2021rademacher
Remark 1
...and 30 more

Improving LSH via Tensorized Random Projection

TL;DR

Abstract

Improving LSH via Tensorized Random Projection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (40)