Table of Contents
Fetching ...

Explicit Min-wise Hash Families with Optimal Size

Xue Chen, Shengtang Huang, Xin Li

TL;DR

The paper tackles the problem of constructing explicit min-wise hash families with seed length that is optimal up to constants while achieving sub-constant (almost polynomially small) error. It develops two main PRG-based strategies: (i) an extractor-enhanced approach that makes a designated bucket’s output perfectly uniform, and (ii) a direct-sum approach combining limited-independence with a PRG for combinatorial rectangles, enabling multiplicative error guarantees. The authors prove new results: an explicit min-wise hash with seed length $O( ext{log} N)$ and error $2^{-O( ext{log} N/ ext{log} ext{log} N)}$, and a $k$-min-wise hash with seed length $O(k ext{log} N)$ and the same sub-constant error for $k= ext{polylog}(N)$. These constructions close the gap to optimal seed length in the sub-constant error regime and have direct implications for space complexity in streaming and similarity-estimation tasks where $k= ext{polylog}(N)$ and $ ext{delta}=2^{-O( rac{ ext{log} N}{ ext{log} ext{log} N})}$. Overall, the work advances the derandomization of min-wise hashing by extending the Nisan-Zuckerman PRG framework to multiplicative-error targets and demonstrates new ways to fuse extractors and PRGs to fool conditional events in combinatorial rectangles.

Abstract

We study explicit constructions of min-wise hash families and their extension to $k$-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset $X\subseteq[N]$, every element in $X$ has an equal chance to have the smallest value among all elements in $X$; a $k$-min-wise hash family guarantees this for every subset of size $k$ in $X$. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and $\ell_0$ sampling. The classical works by Indyk and Pătraşcu and Thorup have shown $Θ(\log(1/δ))$-wise independent families give min-wise hash of multiplicative (relative) error $δ$, resulting in a construction with $Θ(\log(1/δ)\log N)$ random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopalan and Yehudayoff improved the number of bits to $O(\log N\log\log N)$ for polynomially small errors $δ$. However, no construction with $O(\log N)$ bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing ($k$-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of $k$-min-wise hash with $O(k\log N)$ bits and $2^{-O(\log N/\log\log N)}$ error. This improves all previous results for any $k=\log^{O(1)}N$ under $O(k \log N)$ bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.

Explicit Min-wise Hash Families with Optimal Size

TL;DR

The paper tackles the problem of constructing explicit min-wise hash families with seed length that is optimal up to constants while achieving sub-constant (almost polynomially small) error. It develops two main PRG-based strategies: (i) an extractor-enhanced approach that makes a designated bucket’s output perfectly uniform, and (ii) a direct-sum approach combining limited-independence with a PRG for combinatorial rectangles, enabling multiplicative error guarantees. The authors prove new results: an explicit min-wise hash with seed length and error , and a -min-wise hash with seed length and the same sub-constant error for . These constructions close the gap to optimal seed length in the sub-constant error regime and have direct implications for space complexity in streaming and similarity-estimation tasks where and . Overall, the work advances the derandomization of min-wise hashing by extending the Nisan-Zuckerman PRG framework to multiplicative-error targets and demonstrates new ways to fuse extractors and PRGs to fool conditional events in combinatorial rectangles.

Abstract

We study explicit constructions of min-wise hash families and their extension to -min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset , every element in has an equal chance to have the smallest value among all elements in ; a -min-wise hash family guarantees this for every subset of size in . Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and sampling. The classical works by Indyk and Pătraşcu and Thorup have shown -wise independent families give min-wise hash of multiplicative (relative) error , resulting in a construction with random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopalan and Yehudayoff improved the number of bits to for polynomially small errors . However, no construction with bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing (-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of -min-wise hash with bits and error. This improves all previous results for any under bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.

Paper Structure

This paper contains 35 sections, 18 theorems, 81 equations.

Key Result

Theorem 1.2

Given any $N$, there exists an explicit family of min-wise hash of $O(\log N)$ bits and (multiplicative) error $\delta=2^{-O\left(\frac{\log N}{\log \log N}\right)}$.

Theorems & Definitions (36)

  • Definition 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Definition 2.1: Read-once branching program
  • Definition 2.2: Pseudorandom generator
  • Definition 2.3
  • Lemma 2.4
  • Definition 2.5
  • Lemma 2.6
  • Theorem 2.7
  • ...and 26 more