Table of Contents
Fetching ...

Replicability in High Dimensional Statistics

Max Hopkins, Russell Impagliazzo, Daniel Kane, Sihan Liu, Christopher Ye

TL;DR

This work establishes a principled link between replicable statistics in high dimensions and low-surface-area isoperimetric tilings of space, revealing a computational and statistical equivalence that governs the cost of replication for mean estimation and multi-hypothesis testing. It proves that replicable mean estimation and the N-Coin problem are essentially as hard as constructing approximate isoperimetric tilings, yielding near-tight sample complexities (up to polylog factors) in the dimension $N$, and tight linear-overhead bounds in several regimes. To address practical limits, the authors introduce relaxed frameworks—pre-processing, adaptivity, and approximate replicability—that enable polynomial-time algorithms achieving state-of-the-art sample complexity in mean estimation and coin-type tasks, while preserving replication guarantees to a meaningful degree. They also develop a rounding scheme that converts tilings into replicable mean estimators and demonstrate a two-way correspondence: replicable algorithms induce tilings, and tilings yield replicable learning, with implications for the $N$-coin problem and adaptation strategies. Beyond foundational results, the paper discusses lattice-based tilings, the closest-vector problem, and adaptive compositions that yield practical, scalable replicable procedures for high-dimensional statistical tasks. Overall, the work advances understanding of replicability as a geometry-informed computational problem, with potential impact on robust scientific inference and large-scale multiple testing.

Abstract

The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for $1$-dimensional tasks including statistical queries. In this work, we study the computational and statistical cost of replicability for several fundamental high dimensional statistical tasks, including multi-hypothesis testing and mean estimation. Our main contribution establishes a computational and statistical equivalence between optimal replicable algorithms and high dimensional isoperimetric tilings. As a consequence, we obtain matching sample complexity upper and lower bounds for replicable mean estimation of distributions with bounded covariance, resolving an open problem of [Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell, STOC2023] and for the $N$-Coin Problem, resolving a problem of [Karbasi, Velegkas, Yang, and Zhou, NeurIPS2023] up to log factors. While our equivalence is computational, allowing us to shave log factors in sample complexity from the best known efficient algorithms, efficient isoperimetric tilings are not known. To circumvent this, we introduce several relaxed paradigms that do allow for sample and computationally efficient algorithms, including allowing pre-processing, adaptivity, and approximate replicability. In these cases we give efficient algorithms matching or beating the best known sample complexity for mean estimation and the coin problem, including a generic procedure that reduces the standard quadratic overhead of replicability to linear in expectation.

Replicability in High Dimensional Statistics

TL;DR

This work establishes a principled link between replicable statistics in high dimensions and low-surface-area isoperimetric tilings of space, revealing a computational and statistical equivalence that governs the cost of replication for mean estimation and multi-hypothesis testing. It proves that replicable mean estimation and the N-Coin problem are essentially as hard as constructing approximate isoperimetric tilings, yielding near-tight sample complexities (up to polylog factors) in the dimension , and tight linear-overhead bounds in several regimes. To address practical limits, the authors introduce relaxed frameworks—pre-processing, adaptivity, and approximate replicability—that enable polynomial-time algorithms achieving state-of-the-art sample complexity in mean estimation and coin-type tasks, while preserving replication guarantees to a meaningful degree. They also develop a rounding scheme that converts tilings into replicable mean estimators and demonstrate a two-way correspondence: replicable algorithms induce tilings, and tilings yield replicable learning, with implications for the -coin problem and adaptation strategies. Beyond foundational results, the paper discusses lattice-based tilings, the closest-vector problem, and adaptive compositions that yield practical, scalable replicable procedures for high-dimensional statistical tasks. Overall, the work advances understanding of replicability as a geometry-informed computational problem, with potential impact on robust scientific inference and large-scale multiple testing.

Abstract

The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for -dimensional tasks including statistical queries. In this work, we study the computational and statistical cost of replicability for several fundamental high dimensional statistical tasks, including multi-hypothesis testing and mean estimation. Our main contribution establishes a computational and statistical equivalence between optimal replicable algorithms and high dimensional isoperimetric tilings. As a consequence, we obtain matching sample complexity upper and lower bounds for replicable mean estimation of distributions with bounded covariance, resolving an open problem of [Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell, STOC2023] and for the -Coin Problem, resolving a problem of [Karbasi, Velegkas, Yang, and Zhou, NeurIPS2023] up to log factors. While our equivalence is computational, allowing us to shave log factors in sample complexity from the best known efficient algorithms, efficient isoperimetric tilings are not known. To circumvent this, we introduce several relaxed paradigms that do allow for sample and computationally efficient algorithms, including allowing pre-processing, adaptivity, and approximate replicability. In these cases we give efficient algorithms matching or beating the best known sample complexity for mean estimation and the coin problem, including a generic procedure that reduces the standard quadratic overhead of replicability to linear in expectation.
Paper Structure (102 sections, 113 theorems, 349 equations, 12 algorithms)

This paper contains 102 sections, 113 theorems, 349 equations, 12 algorithms.

Key Result

Theorem 1.2

Let $p_0,q_0 \in (0,1/2)$ and $\rho \in (0,1)$, there is a computationally efficient $\rho$-replicable algorithm for the $(p_0,q_0)$-coin problem using samples. Conversely, any algorithm for the $(p_0,q_0)$-coin problem uses at least samples in the worst-case.

Theorems & Definitions (225)

  • Definition 1.1: impagliazzo2022reproducibility
  • Definition 1.1: Hypothesis Testing
  • Theorem 1.2: impagliazzo2022reproducibilityKVYZ23
  • Theorem 1.3: Informal \ref{['thm:r-adapt-coin-problem']} and \ref{['thm:q0-num-lower-bound']}
  • Definition 1.4: Isoperimetric Approximate Tilings (Informal \ref{['def:approximate-tiling']})
  • Theorem 1.5: Replicability $\iff$ Isoperimetry (Informal \ref{['thm:tiling-to-replicable']} and \ref{['thm:replicable-non-uniform-tiling-formal']})
  • Corollary 1.6: Replicable $\ell_2$ Mean Estimation (Informal \ref{['thm:replicable-alg-partition-lb']} and \ref{['cor:ell-2-mean-estimation']}
  • Corollary 1.7: Efficient Mean Estimation in Sub-Cubic Samples (Informal \ref{['cor:sub-cubic']})
  • Corollary 1.8: $\ell_p$-norm Replicability $\iff$ Tilings
  • Theorem 1.9: Replicable $\ell_\infty$-Mean-Estimation (Informal \ref{['thm:ell-infty-mean']} and \ref{['thm:n-coin-lower-const-delta']})
  • ...and 215 more