Improving LSH via Tensorized Random Projection
Bhisham Dev Verma, Rameshwar Pratap
TL;DR
The paper addresses the prohibitive exponential scaling when applying traditional LSH to high-order tensors by avoiding reshaping tensors into vectors and instead using tensorized random projections based on CP and Tensor Train decompositions. It introduces four methods—CP-E2LSH, TT-E2LSH, CP-SRP, and TT-SRP—that project tensors onto low-rank CP or TT projection tensors, followed by discretization or sign hashing, and proves asymptotic Gaussianity of the projections to establish LSH guarantees for Euclidean distance and cosine similarity. The key contributions include space complexities of $O(N d R)$ (CP) and $O(N d R^2)$ (TT), along with time complexities that scale favorably when inputs are provided in CP/TT formats, and rigorous theoretical guarantees for collision probabilities aligning with standard LSH behavior. The proposed tensorized approach enables scalable, practical near-neighbor search on multidimensional data while preserving the fundamental LSH properties, with broad impact for mining and processing large tensor-structured datasets.
Abstract
Locality sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large scale data processing applications such as near duplicate detection, nearest neighbour search, clustering, etc. In this work, we aim to propose faster and space efficient locality sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data $E2LSH$ and $SRP$. However, this approach becomes impractical for higher order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely $CP-E2LSH$, $TT-E2LSH$, and $CP-SRP$, $TT-SRP$, respectively, building on $CP$ and tensor train $(TT)$ decompositions techniques. Our approaches are space efficient and can be efficiently applied to low rank $CP$ or $TT$ tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.
