Table of Contents
Fetching ...

Interactive Proofs For Distribution Testing With Conditional Oracles

Ari Biswas, Mark Bun, Clément Canonne, Satchit Sivakumar

TL;DR

The paper tackles the high sample cost in distribution property testing by introducing a polylogarithmic-query, polylogarithmic-sample, sublinear-communication interactive proof framework that uses pairwise conditional (PCond) queries. Central to the approach is the unlabelled bucket histogram as a succinct statistic for label-invariant properties, augmented by learning bucket masses via pairwise comparisons and using neighborhood/moat concepts to certify point probabilities and support sizes. The authors provide both lower bounds showing PCond-query limitations and constructive protocols enabling tolerant verification of label-invariant properties with near-optimal communication, thus achieving exponential savings over plain sampling. Together, these results advance practical verification of distribution properties on large domains and establish a foundation for efficient, conditional-access verified testing. The work also surveys related interactive-proof/conditional-sampling literature and proposes several open directions for tightening complexity and enabling broader composability.

Abstract

We revisit the framework of interactive proofs for distribution testing, first introduced by Chiesa and Gur (ITCS 2018), which has recently experienced a surge in interest, accompanied by notable progress (e.g., Herman and Rothblum, STOC 2022, FOCS 2023; Herman, RANDOM~2024). In this model, a data-poor verifier determines whether a probability distribution has a property of interest by interacting with an all-powerful, data-rich but untrusted prover bent on convincing them that it has the property. While prior work gave sample-, time-, and communication-efficient protocols for testing and estimating a range of distribution properties, they all suffer from an inherent issue: for most interesting properties of distributions over a domain of size $N$, the verifier must draw at least $Ω(\sqrt{N})$ samples of its own. While sublinear in $N$, this is still prohibitive for large domains encountered in practice. In this work, we circumvent this limitation by augmenting the verifier with the ability to perform an exponentially smaller number of more powerful (but reasonable) \emph{pairwise conditional} queries, effectively enabling them to perform ``local comparison checks'' of the prover's claims. We systematically investigate the landscape of interactive proofs in this new setting, giving polylogarithmic query and sample protocols for (tolerantly) testing all \emph{label-invariant} properties, thus demonstrating exponential savings without compromising on communication, for this large and fundamental class of testing tasks.

Interactive Proofs For Distribution Testing With Conditional Oracles

TL;DR

The paper tackles the high sample cost in distribution property testing by introducing a polylogarithmic-query, polylogarithmic-sample, sublinear-communication interactive proof framework that uses pairwise conditional (PCond) queries. Central to the approach is the unlabelled bucket histogram as a succinct statistic for label-invariant properties, augmented by learning bucket masses via pairwise comparisons and using neighborhood/moat concepts to certify point probabilities and support sizes. The authors provide both lower bounds showing PCond-query limitations and constructive protocols enabling tolerant verification of label-invariant properties with near-optimal communication, thus achieving exponential savings over plain sampling. Together, these results advance practical verification of distribution properties on large domains and establish a foundation for efficient, conditional-access verified testing. The work also surveys related interactive-proof/conditional-sampling literature and proposes several open directions for tightening complexity and enabling broader composability.

Abstract

We revisit the framework of interactive proofs for distribution testing, first introduced by Chiesa and Gur (ITCS 2018), which has recently experienced a surge in interest, accompanied by notable progress (e.g., Herman and Rothblum, STOC 2022, FOCS 2023; Herman, RANDOM~2024). In this model, a data-poor verifier determines whether a probability distribution has a property of interest by interacting with an all-powerful, data-rich but untrusted prover bent on convincing them that it has the property. While prior work gave sample-, time-, and communication-efficient protocols for testing and estimating a range of distribution properties, they all suffer from an inherent issue: for most interesting properties of distributions over a domain of size , the verifier must draw at least samples of its own. While sublinear in , this is still prohibitive for large domains encountered in practice. In this work, we circumvent this limitation by augmenting the verifier with the ability to perform an exponentially smaller number of more powerful (but reasonable) \emph{pairwise conditional} queries, effectively enabling them to perform ``local comparison checks'' of the prover's claims. We systematically investigate the landscape of interactive proofs in this new setting, giving polylogarithmic query and sample protocols for (tolerantly) testing all \emph{label-invariant} properties, thus demonstrating exponential savings without compromising on communication, for this large and fundamental class of testing tasks.

Paper Structure

This paper contains 27 sections, 17 theorems, 51 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1.1

There exists a label-invariant property $\Pi$ such that every tester with access to a $\texttt{PCond}$ oracle for $\Pi$ with proximity parameter $\tau \leq 1/2$ and failure probability $0.01$ must make $\Omega\left(N^{1/3}\right)$ queries.

Figures (9)

  • Figure 1: The approximate histogram $h_\mathcal{D}^{(\tau)}$ induced by a $(N, \tau)$-partitioning of $[N]$ according to $\mathcal{D}$. The blue dots denote domain elements $x \in [N]$ positioned on the interval $[0,1]$ according their probability mass ${\textcolor{black}{\mathcal{D}}\left[x\right]}$. Buckets $(B_1, \ldots, B_L)$ denote a disjoint $\tau$-partitioning of domain, where $B_i$ is the set of domain elements $x$, such that ${\textcolor{black}{\mathcal{D}}\left[x\right]} \in \left(\frac{\tau(1+\tau)^{i-1}}{N},\,\frac{\tau(1+\tau)^{i}}{N}\right]$. The mass of bucket $i$ is denoted with $\textcolor{black}{p_i} = \sum_{x \in B_i} {\textcolor{black}{\mathcal{D}}\left[x\right]}$.
  • Figure 2: Simulation of $\texttt{PCond}$ Oracle with only access to samples from $\mathcal{D}$
  • Figure 3: Proof System For Support Size Range For Distributions With Large Support
  • Figure 4: First Proof System For Support Size Range For Distributions With Large Support
  • Figure 5: Second Proof System For Support Size Range For Distributions With Large Support
  • ...and 4 more figures

Theorems & Definitions (48)

  • Theorem 1.1: Informal Version of \ref{['cor:lowerbound']}
  • Theorem 1.2: Informal Label-Invariant Tolerant Verification Theorem (Theorem \ref{['thm:main-thm']})
  • Theorem 1.3: Informal Version of \ref{['lemma:approximate-single']}
  • Definition 2.1: Label-Invariant Properties
  • Definition 2.2: Relabelling Distance
  • Definition 2.3: Histograms Of Distributions
  • Definition 2.4: $(N, \tau)$-Bucketing
  • Definition 2.5: Approximate Histogram of a Distribution
  • Definition 2.6: Earth-Mover Distance and Relative Earth-Mover Distance
  • Lemma 2.7: Relationship Between Re-labelling Distance And Earth Mover Distance
  • ...and 38 more