Testing Distributions of Huge Objects
Oded Goldreich, Dana Ron
TL;DR
This work introduces the Testing Distributions of Huge Objects (DoHO) model, which blends distribution testing with property testing on very long objects by sampling distributions over $n$-bit strings and probing each sample at selected coordinates. Distance between distributions is defined via earth mover distance under the relative Hamming distance, enabling sublinear queries relative to object size while preserving meaningful proximity notions. The paper delivers general bounds linking DoHO query complexity to standard sample complexities, develops testers for natural properties such as support size, uniformity, and $m$-granularity, and extends to tuples of distributions and to distributions that arise as perturbations, random cyclic shifts, and random isomorphic copies of graphs. Additional contributions include testers for equality of distributions in the DoHO setting, a framework for self-correctable testable subsets, and a detailed exploration of how these ideas apply to structured objects like graphs and cyclic shifts. Overall, DoHO provides a versatile toolkit for analyzing distributions over huge objects with sublinear probing, with implications for genetics, large-scale data, and graph-related problems where full readout is impractical.
Abstract
We initiate a study of a new model of property testing that is a hybrid of testing properties of distributions and testing properties of strings. Specifically, the new model refers to testing properties of distributions, but these are distributions over huge objects (i.e., very long strings). Accordingly, the model accounts for the total number of local probes into these objects (resp., queries to the strings) as well as for the distance between objects (resp., strings), and the distance between distributions is defined as the earth mover's distance with respect to the relative Hamming distance between strings. We study the query complexity of testing in this new model, focusing on three directions. First, we try to relate the query complexity of testing properties in the new model to the sample complexity of testing these properties in the standard distribution testing model. Second, we consider the complexity of testing properties that arise naturally in the new model (e.g., distributions that capture random variations of fixed strings). Third, we consider the complexity of testing properties that were extensively studied in the standard distribution testing model: Two such cases are uniform distributions and pairs of identical distributions.
