Testing with Non-identically Distributed Samples

Shivam Garg; Chirag Pabbaraju; Kirankumar Shiragur; Gregory Valiant

Testing with Non-identically Distributed Samples

Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

TL;DR

This work studies property testing and estimation when samples come from a collection of heterogeneous distributions and targets properties of their average $\mathbf p_{avg}$. It shows that learning $\mathbf p_{avg}$ with $c=1$ per distribution matches i.i.d. sample complexity, but sublinear testing is impossible in this regime; once $c\ge 2$, sublinear guarantees akin to the i.i.d. setting emerge, with uniformity and identity testing achieving rates $O(\sqrt{k}/\varepsilon^2+1/\varepsilon^4)$ (and closeness testing attaining near i.i.d. benchmarks under certain regimes). The paper introduces collision-based estimators adapted to non-identical samples, proves variance bounds, and demonstrates a fundamental lower bound for pooling-based (label-ignoring) estimators, highlighting the importance of preserving per-distribution origin information. It also connects these results to Poissonization and extends techniques to closeness testing. The findings have practical implications for federated, temporal, and spatial data where heterogeneity is intrinsic, and they outline clear directions for tightening $\varepsilon$-dependences and expanding to additional properties.

Abstract

We examine the extent to which sublinear-sample property testing and estimation apply to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{avg}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Θ(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{avg}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity -- distinguishing the case that $p_{avg}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $ρ> 0$ such that even in the linear regime with $ρk$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.

Testing with Non-identically Distributed Samples

TL;DR

This work studies property testing and estimation when samples come from a collection of heterogeneous distributions and targets properties of their average

. It shows that learning

with

per distribution matches i.i.d. sample complexity, but sublinear testing is impossible in this regime; once

, sublinear guarantees akin to the i.i.d. setting emerge, with uniformity and identity testing achieving rates

(and closeness testing attaining near i.i.d. benchmarks under certain regimes). The paper introduces collision-based estimators adapted to non-identical samples, proves variance bounds, and demonstrates a fundamental lower bound for pooling-based (label-ignoring) estimators, highlighting the importance of preserving per-distribution origin information. It also connects these results to Poissonization and extends techniques to closeness testing. The findings have practical implications for federated, temporal, and spatial data where heterogeneity is intrinsic, and they outline clear directions for tightening

-dependences and expanding to additional properties.

Abstract

, and we obtain

independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution,

. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with

samples from each distribution,

samples are necessary and sufficient to learn

to within error

distance. To test uniformity or identity -- distinguishing the case that

is equal to some reference distribution, versus has

distance at least

from the reference distribution, we show that a linear number of samples in

is necessary given

samples from each distribution. In contrast, for

, we recover the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show that

total samples are sufficient, matching the optimal sample complexity in the i.i.d.\ case in the regime where

. Additionally, we show that in the

case, there is a constant

such that even in the linear regime with

samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same

) can perform uniformity testing. We also extend our techniques to the problem of testing "closeness" of two distributions.

Paper Structure (33 sections, 34 theorems, 175 equations, 1 figure)

This paper contains 33 sections, 34 theorems, 175 equations, 1 figure.

Introduction
Summary of Results
Future Directions
Related Work
Beyond i.i.d. samples:
Uniformity testing from non-identical samples
Uniform case:
Far from uniform case:
Identity testing from non-identical samples
Claim 1:
Claim 2:
Sampling:
Lower bound for pooling-based estimators
Step 1:
Step 2:
...and 18 more sections

Key Result

Theorem 1.1

There is an absolute constant, $\alpha$, such that given access to $T$ distributions, $\textbf{p}_1,\ldots,\textbf{p}_T$, each supported on a common domain of size $\le k$, for any ${\varepsilon}>0,$ provided $T \ge \alpha(\sqrt{k}/{\varepsilon}^2+1/{\varepsilon}^4)$ and given $c=2$ samples drawn fr

Figures (1)

Figure 1: Building block for the hard instance defined in Definition \ref{['def:structure_a_b']}.

Theorems & Definitions (84)

Example 1
Example 2
Example 3
Claim 1.0: Learning the distribution, proof in \ref{['sec:learning']}
Claim 1.0: Impossibility result with $c=1$, proof in \ref{['sec:impossibility_cone_proof']}
Theorem 1.1: Uniformity testing, proof in \ref{['sec:uni']}
Lemma 1.1: Identity to uniformity testing, proof in \ref{['sec:identouni']}
Corollary 1.1: Identity testing
Theorem 1.2: Lower bound for "pooled" estimators, proof in \ref{['sec:lb_pool']}
Theorem 1.3: Closeness testing, proof in Appendix \ref{['section:closeness']}
...and 74 more

Testing with Non-identically Distributed Samples

TL;DR

Abstract

Testing with Non-identically Distributed Samples

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (84)