Table of Contents
Fetching ...

Improving LSH via Tensorized Random Projection

Bhisham Dev Verma, Rameshwar Pratap

TL;DR

The paper addresses the prohibitive exponential scaling when applying traditional LSH to high-order tensors by avoiding reshaping tensors into vectors and instead using tensorized random projections based on CP and Tensor Train decompositions. It introduces four methods—CP-E2LSH, TT-E2LSH, CP-SRP, and TT-SRP—that project tensors onto low-rank CP or TT projection tensors, followed by discretization or sign hashing, and proves asymptotic Gaussianity of the projections to establish LSH guarantees for Euclidean distance and cosine similarity. The key contributions include space complexities of $O(N d R)$ (CP) and $O(N d R^2)$ (TT), along with time complexities that scale favorably when inputs are provided in CP/TT formats, and rigorous theoretical guarantees for collision probabilities aligning with standard LSH behavior. The proposed tensorized approach enables scalable, practical near-neighbor search on multidimensional data while preserving the fundamental LSH properties, with broad impact for mining and processing large tensor-structured datasets.

Abstract

Locality sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large scale data processing applications such as near duplicate detection, nearest neighbour search, clustering, etc. In this work, we aim to propose faster and space efficient locality sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data $E2LSH$ and $SRP$. However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely $CP-E2LSH$, $TT-E2LSH$, and $CP-SRP$, $TT-SRP$, respectively, building on $CP$ and tensor train $(TT)$ decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank $CP$ or $TT$ tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.

Improving LSH via Tensorized Random Projection

TL;DR

The paper addresses the prohibitive exponential scaling when applying traditional LSH to high-order tensors by avoiding reshaping tensors into vectors and instead using tensorized random projections based on CP and Tensor Train decompositions. It introduces four methods—CP-E2LSH, TT-E2LSH, CP-SRP, and TT-SRP—that project tensors onto low-rank CP or TT projection tensors, followed by discretization or sign hashing, and proves asymptotic Gaussianity of the projections to establish LSH guarantees for Euclidean distance and cosine similarity. The key contributions include space complexities of (CP) and (TT), along with time complexities that scale favorably when inputs are provided in CP/TT formats, and rigorous theoretical guarantees for collision probabilities aligning with standard LSH behavior. The proposed tensorized approach enables scalable, practical near-neighbor search on multidimensional data while preserving the fundamental LSH properties, with broad impact for mining and processing large tensor-structured datasets.

Abstract

Locality sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large scale data processing applications such as near duplicate detection, nearest neighbour search, clustering, etc. In this work, we aim to propose faster and space efficient locality sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data and . However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely , , and , , respectively, building on and tensor train decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank or tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.
Paper Structure (19 sections, 11 theorems, 90 equations, 2 tables)

This paper contains 19 sections, 11 theorems, 90 equations, 2 tables.

Key Result

Theorem 1

janson1988normal Let $\{X_{1}, \ldots, X_{d}\}$ be a family of bounded random variables, i.e. $|X_{i}| \leq A$. Suppose that $\Gamma_{d}$ is a dependency graph for this family and let $M$ be the maximal degree of $\Gamma_{d}$ (if $\Gamma_{d}$ has no edges, in that case, we set $M=1$). Let $S_{d} = \ where $\overset{\mathcal{D}}{\to}$ indicates the convergence in distribution.

Theorems & Definitions (40)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4: CP Decomposition kolda2009tensor
  • Definition 5: TT Decomposition oseledets2011tensor
  • Definition 6: CP-Rademacher Distributed Tensor rakhshan2021rademacher
  • Definition 7: TT-Rademacher Distributed Tensor rakhshan2021rademacher
  • Definition 8: CP Rademacher Random Projection rakhshan2020tensorizedrakhshan2021rademacher
  • Definition 9: TT Radmecher Random Projection rakhshan2020tensorizedrakhshan2021rademacher
  • Remark 1
  • ...and 30 more