Explicit Min-wise Hash Families with Optimal Size

Xue Chen; Shengtang Huang; Xin Li

Explicit Min-wise Hash Families with Optimal Size

Xue Chen, Shengtang Huang, Xin Li

TL;DR

The paper tackles the problem of constructing explicit min-wise hash families with seed length that is optimal up to constants while achieving sub-constant (almost polynomially small) error. It develops two main PRG-based strategies: (i) an extractor-enhanced approach that makes a designated bucket’s output perfectly uniform, and (ii) a direct-sum approach combining limited-independence with a PRG for combinatorial rectangles, enabling multiplicative error guarantees. The authors prove new results: an explicit min-wise hash with seed length $O( ext{log} N)$ and error $2^{-O( ext{log} N/ ext{log} ext{log} N)}$, and a $k$-min-wise hash with seed length $O(k ext{log} N)$ and the same sub-constant error for $k= ext{polylog}(N)$. These constructions close the gap to optimal seed length in the sub-constant error regime and have direct implications for space complexity in streaming and similarity-estimation tasks where $k= ext{polylog}(N)$ and $ ext{delta}=2^{-O(rac{ ext{log} N}{ ext{log} ext{log} N})}$. Overall, the work advances the derandomization of min-wise hashing by extending the Nisan-Zuckerman PRG framework to multiplicative-error targets and demonstrates new ways to fuse extractors and PRGs to fool conditional events in combinatorial rectangles.

Abstract

We study explicit constructions of min-wise hash families and their extension to $k$-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset $X\subseteq[N]$, every element in $X$ has an equal chance to have the smallest value among all elements in $X$; a $k$-min-wise hash family guarantees this for every subset of size $k$ in $X$. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and $\ell_0$ sampling. The classical works by Indyk and Pătraşcu and Thorup have shown $Θ(\log(1/δ))$-wise independent families give min-wise hash of multiplicative (relative) error $δ$, resulting in a construction with $Θ(\log(1/δ)\log N)$ random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopalan and Yehudayoff improved the number of bits to $O(\log N\log\log N)$ for polynomially small errors $δ$. However, no construction with $O(\log N)$ bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing ($k$-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of $k$-min-wise hash with $O(k\log N)$ bits and $2^{-O(\log N/\log\log N)}$ error. This improves all previous results for any $k=\log^{O(1)}N$ under $O(k \log N)$ bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.

Explicit Min-wise Hash Families with Optimal Size

TL;DR

and error

, and a

-min-wise hash with seed length

and the same sub-constant error for

. These constructions close the gap to optimal seed length in the sub-constant error regime and have direct implications for space complexity in streaming and similarity-estimation tasks where

and

. Overall, the work advances the derandomization of min-wise hashing by extending the Nisan-Zuckerman PRG framework to multiplicative-error targets and demonstrates new ways to fuse extractors and PRGs to fool conditional events in combinatorial rectangles.

Abstract

We study explicit constructions of min-wise hash families and their extension to

-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset

, every element in

has an equal chance to have the smallest value among all elements in

; a

-min-wise hash family guarantees this for every subset of size

. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and

sampling. The classical works by Indyk and Pătraşcu and Thorup have shown

-wise independent families give min-wise hash of multiplicative (relative) error

, resulting in a construction with

random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopalan and Yehudayoff improved the number of bits to

for polynomially small errors

. However, no construction with

bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing (

-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of

-min-wise hash with

bits and

error. This improves all previous results for any

under

bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.

Explicit Min-wise Hash Families with Optimal Size

TL;DR

Abstract

Explicit Min-wise Hash Families with Optimal Size

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (36)