Efficient Adaptive Data Analysis over Dense Distributions

Joon Suk Huh

Efficient Adaptive Data Analysis over Dense Distributions

Joon Suk Huh

TL;DR

This paper tackles the fundamental tension between computational efficiency and statistical efficiency in adaptive data analysis by identifying a natural class of dense data distributions relative to a known prior. It introduces a computationally efficient ADA mechanism, AQA-OGD, that achieves $O(\lambda^{-2} \epsilon^{-4} \log T)$ sample complexity, independent of ambient dimension, for $T$ adaptively chosen queries. It further shows how this mechanism yields a distribution-specific SQ oracle with $O(\epsilon^{-4}\log(T/\epsilon))$ samples and provides a PSO-based privacy guarantee rather than a DP guarantee, linking adaptive data analysis to privacy beyond differential privacy. The approach relies on a lazy online gradient-descent sketch to replace costly PMW-like data sketches, with a succinct transcript enabling uniform convergence via Hoeffding bounds, and it demonstrates the DSQ and privacy properties in a unified framework. Overall, the work broadens the applicability of efficient ADA to practical, distribution-specific settings with strong theoretical guarantees on both accuracy and privacy risk.

Abstract

Modern data workflows are inherently adaptive, repeatedly querying the same dataset to refine and validate sequential decisions, but such adaptivity can lead to overfitting and invalid statistical inference. Adaptive Data Analysis (ADA) mechanisms address this challenge; however, there is a fundamental tension between computational efficiency and sample complexity. For $T$ rounds of adaptive analysis, computationally efficient algorithms typically incur suboptimal $O(\sqrt{T})$ sample complexity, whereas statistically optimal $O(\log T)$ algorithms are computationally intractable under standard cryptographic assumptions. In this work, we shed light on this trade-off by identifying a natural class of data distributions under which both computational efficiency and optimal sample complexity are achievable. We propose a computationally efficient ADA mechanism that attains optimal $O(\log T)$ sample complexity when the data distribution is dense with respect to a known prior. This setting includes, in particular, feature--label data distributions arising in distribution-specific learning. As a consequence, our mechanism also yields a sample-efficient (i.e., $O(\log T)$ samples) statistical query oracle in the distribution-specific setting. Moreover, although our algorithm is not based on differential privacy, it satisfies a relaxed privacy notion known as Predicate Singling Out (PSO) security (Cohen and Nissim, 2020). Our results thus reveal an inherent connection between adaptive data analysis and privacy beyond differential privacy.

Efficient Adaptive Data Analysis over Dense Distributions

TL;DR

sample complexity, independent of ambient dimension, for

adaptively chosen queries. It further shows how this mechanism yields a distribution-specific SQ oracle with

samples and provides a PSO-based privacy guarantee rather than a DP guarantee, linking adaptive data analysis to privacy beyond differential privacy. The approach relies on a lazy online gradient-descent sketch to replace costly PMW-like data sketches, with a succinct transcript enabling uniform convergence via Hoeffding bounds, and it demonstrates the DSQ and privacy properties in a unified framework. Overall, the work broadens the applicability of efficient ADA to practical, distribution-specific settings with strong theoretical guarantees on both accuracy and privacy risk.

Abstract

rounds of adaptive analysis, computationally efficient algorithms typically incur suboptimal

sample complexity, whereas statistically optimal

algorithms are computationally intractable under standard cryptographic assumptions. In this work, we shed light on this trade-off by identifying a natural class of data distributions under which both computational efficiency and optimal sample complexity are achievable. We propose a computationally efficient ADA mechanism that attains optimal

sample complexity when the data distribution is dense with respect to a known prior. This setting includes, in particular, feature--label data distributions arising in distribution-specific learning. As a consequence, our mechanism also yields a sample-efficient (i.e.,

samples) statistical query oracle in the distribution-specific setting. Moreover, although our algorithm is not based on differential privacy, it satisfies a relaxed privacy notion known as Predicate Singling Out (PSO) security (Cohen and Nissim, 2020). Our results thus reveal an inherent connection between adaptive data analysis and privacy beyond differential privacy.

Paper Structure (32 sections, 12 theorems, 28 equations)

This paper contains 32 sections, 12 theorems, 28 equations.

Introduction
Main Results
Efficient ADA Mechanism for Dense Distributions.
Sample-efficient Distribution-Specific Statistical Query Oracle.
Privacy Guarantee.
Technical Overview
Related Work
Organization
Preliminaries
Notation.
Problem Setting: Adaptive Data Analysis.
Dense Data Distribution.
Working in the Seed Space Without Loss of Generality.
Application: Implementing Distribution-Specific Statistical Query Oracle
ADA Mechanism via Online Gradient Descent
...and 17 more sections

Key Result

theorem 1

Let $g:\{0,1\}^n \rightarrow \mathcal{I}$ be an efficiently computable function. For any distribution $\mathcal{D}$ over $\mathcal{I}$ that is $\lambda$-dense with respect to the distribution $\mathcal{D}_g$ generated by $g$, there exists an ADA mechanism that answers $\mathop{\mathrm{\mathbb{E}}}\l

Theorems & Definitions (22)

theorem 1: Restatement of Theorem \ref{['thm:AQALS-CC']}, \ref{['thm:AQALS-SC']}
definition 1: Accuracy
definition 2: Pseudodensity
definition 3: Distribution-specific SQ oracle kearns1998efficient
theorem 2
theorem 3
theorem 4
definition 4: Row isolation
theorem 5
theorem 5
...and 12 more

Efficient Adaptive Data Analysis over Dense Distributions

TL;DR

Abstract

Efficient Adaptive Data Analysis over Dense Distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (22)