Efficient Adaptive Data Analysis over Dense Distributions
Joon Suk Huh
TL;DR
This paper tackles the fundamental tension between computational efficiency and statistical efficiency in adaptive data analysis by identifying a natural class of dense data distributions relative to a known prior. It introduces a computationally efficient ADA mechanism, AQA-OGD, that achieves $O(\lambda^{-2} \epsilon^{-4} \log T)$ sample complexity, independent of ambient dimension, for $T$ adaptively chosen queries. It further shows how this mechanism yields a distribution-specific SQ oracle with $O(\epsilon^{-4}\log(T/\epsilon))$ samples and provides a PSO-based privacy guarantee rather than a DP guarantee, linking adaptive data analysis to privacy beyond differential privacy. The approach relies on a lazy online gradient-descent sketch to replace costly PMW-like data sketches, with a succinct transcript enabling uniform convergence via Hoeffding bounds, and it demonstrates the DSQ and privacy properties in a unified framework. Overall, the work broadens the applicability of efficient ADA to practical, distribution-specific settings with strong theoretical guarantees on both accuracy and privacy risk.
Abstract
Modern data workflows are inherently adaptive, repeatedly querying the same dataset to refine and validate sequential decisions, but such adaptivity can lead to overfitting and invalid statistical inference. Adaptive Data Analysis (ADA) mechanisms address this challenge; however, there is a fundamental tension between computational efficiency and sample complexity. For $T$ rounds of adaptive analysis, computationally efficient algorithms typically incur suboptimal $O(\sqrt{T})$ sample complexity, whereas statistically optimal $O(\log T)$ algorithms are computationally intractable under standard cryptographic assumptions. In this work, we shed light on this trade-off by identifying a natural class of data distributions under which both computational efficiency and optimal sample complexity are achievable. We propose a computationally efficient ADA mechanism that attains optimal $O(\log T)$ sample complexity when the data distribution is dense with respect to a known prior. This setting includes, in particular, feature--label data distributions arising in distribution-specific learning. As a consequence, our mechanism also yields a sample-efficient (i.e., $O(\log T)$ samples) statistical query oracle in the distribution-specific setting. Moreover, although our algorithm is not based on differential privacy, it satisfies a relaxed privacy notion known as Predicate Singling Out (PSO) security (Cohen and Nissim, 2020). Our results thus reveal an inherent connection between adaptive data analysis and privacy beyond differential privacy.
