Table of Contents
Fetching ...

Optimal Algorithms for Augmented Testing of Discrete Distributions

Maryam Aliakbarpour, Piotr Indyk, Ronitt Rubinfeld, Sandeep Silwal

TL;DR

This work extends distribution-property testing to a setting where a predictor hat{p} is available, introducing a two-part framework: a search component that adaptively guesses an accuracy level $\alpha$ and an augmented tester that uses hat{p} and $\alpha$ to perform uniformity, identity, or closeness testing. The main technical contributions are tight upper and lower bounds on sample complexity that depend on the predictor error $d=\|p-\hat{p}\|_{TV}$, including $s=\Theta(\sqrt{n}/\epsilon^2)$ or $\Theta(\min(1/(d-\alpha)^2, \sqrt{n}/\epsilon^2))$ for identity/uniformity and $s=\Theta(n^{2/3}\alpha^{1/3}/\epsilon^{4/3}+\sqrt{n}/\epsilon^2)$ for closeness, with matching lower bounds. The augmented closeness tester utilizes an augmented flattening technique to reduce the $\ell_2^2$-norm and thereby achieve improved sample complexity when the predictor is informative, while remaining robust to poor predictions. Empirical results on synthetic and real data demonstrate significant practical gains (e.g., >20x reductions on hard instances and up to ~40% reductions on network traffic data) and show robustness to prediction quality, validating the approach for real-world deployment in predictive settings.

Abstract

We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution $p$, extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from $p$. A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.

Optimal Algorithms for Augmented Testing of Discrete Distributions

TL;DR

This work extends distribution-property testing to a setting where a predictor hat{p} is available, introducing a two-part framework: a search component that adaptively guesses an accuracy level and an augmented tester that uses hat{p} and to perform uniformity, identity, or closeness testing. The main technical contributions are tight upper and lower bounds on sample complexity that depend on the predictor error , including or for identity/uniformity and for closeness, with matching lower bounds. The augmented closeness tester utilizes an augmented flattening technique to reduce the -norm and thereby achieve improved sample complexity when the predictor is informative, while remaining robust to poor predictions. Empirical results on synthetic and real data demonstrate significant practical gains (e.g., >20x reductions on hard instances and up to ~40% reductions on network traffic data) and show robustness to prediction quality, validating the approach for real-world deployment in predictive settings.

Abstract

We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution , extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor's quality, measured by its total variation distance from . A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation's accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.

Paper Structure

This paper contains 51 sections, 17 theorems, 80 equations, 7 figures, 1 table, 5 algorithms.

Key Result

Theorem 2

Augmented uniformity and identity testing for distributions over $[n]$, with parameters $\alpha$, $\epsilon$, and $\delta = 2/3$, require the following number of samples: where $\textcolor{Purple}{d} = \|q - \hat{p}\|_{\text{TV}}$ ($q$ is the known distribution for identity testing, or the uniform distribution for uniformity).

Figures (7)

  • Figure 1: Error vs sample complexity for the theoretically hard instance (See Sec. \ref{['sec:experiments']}).
  • Figure 2: A diagram indicating the valid answer for the augmented tester $\mathcal{A}$ based on the total variation distances of $p$ from $q$ and $\hat{p}$ assuming $\textcolor{Purple}{d} \leq \alpha$. The standard tester requires to output accept if $p=q$, the green dot, and reject if $\|p-q\|_{\text{TV}} \geq \epsilon$, the red shaded region, with high probability. In addition, the augmented tester may output inaccurate information for when $\|p-q\|_{\text{TV}} \geq \epsilon$ and $\|p-\hat{p}\|_{\text{TV}} \geq \alpha$.
  • Figure 3: A diagram indicating the invalid answer for the three distributions $U_n$, ${p^\bullet}$, and ${p^\diamond}$.
  • Figure 4: A visualization of $\hat{p}$, $p^+$, and $p^-$.
  • Figure 5: Error as a function of prediction quality for the 'Hard Instance' dataset
  • ...and 2 more figures

Theorems & Definitions (36)

  • Definition 1.1: Augmented tester
  • Remark 1
  • Theorem 2: Informal version of Theorem \ref{['thm:identity_all']}
  • Remark 3
  • Theorem 4: Informal version of Theorem \ref{['thm:closeness_all']}
  • Definition 3.1
  • Theorem 5
  • proof
  • Theorem 6
  • Remark 7
  • ...and 26 more