Table of Contents
Fetching ...

Generalization in the Face of Adaptivity: A Bayesian Perspective

Moshe Shenfeld, Katrina Ligett

TL;DR

The paper tackles the problem of overfitting in adaptive data analysis, where repeat querying of a data sample can distort empirical evaluations relative to the underlying distribution. It introduces a variance-aware approach using the Gaussian mechanism, backed by a novel pairwise concentration (PC) stability framework and a Bayes-stability characterization that connect the harm of adaptivity to a covariance between future queries and a Bayes factor. By proving PC-based composition theorems and deriving variance-dependent generalization bounds, the authors obtain distribution-accuracy guarantees that scale with the query variance rather than the full range, including unbounded but sub-Gaussian queries. This yields significantly tighter guarantees than worst-case DP analyses and provides a practical, data-dependent stability toolkit for adaptive data analysis with simple noise-addition mechanisms. Overall, the work advances understanding of how typical-case information leakage—captured by Bayes factors and PC—governs adaptive generalization, enabling robust, variance-aware guarantees for adaptively queried statistics. $n$ can scale with $\sigma^{2}$ and $k$ as shown, and the Gaussian mechanism with carefully tuned noise achieves distribution accuracy under adaptivity without resorting to worst-case range-based noise calibration.

Abstract

Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the empirical evaluation of queries on the sample significantly deviates from their mean with respect to the underlying data distribution. It turns out that simple noise addition algorithms suffice to prevent this issue, and differential privacy-based analysis of these algorithms shows that they can handle an asymptotically optimal number of queries. However, differential privacy's worst-case nature entails scaling such noise to the range of the queries even for highly-concentrated queries, or introducing more complex algorithms. In this paper, we prove that straightforward noise-addition algorithms already provide variance-dependent guarantees that also extend to unbounded queries. This improvement stems from a novel characterization that illuminates the core problem of adaptive data analysis. We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We then leverage this characterization to introduce a new data-dependent stability notion that can bound this covariance.

Generalization in the Face of Adaptivity: A Bayesian Perspective

TL;DR

The paper tackles the problem of overfitting in adaptive data analysis, where repeat querying of a data sample can distort empirical evaluations relative to the underlying distribution. It introduces a variance-aware approach using the Gaussian mechanism, backed by a novel pairwise concentration (PC) stability framework and a Bayes-stability characterization that connect the harm of adaptivity to a covariance between future queries and a Bayes factor. By proving PC-based composition theorems and deriving variance-dependent generalization bounds, the authors obtain distribution-accuracy guarantees that scale with the query variance rather than the full range, including unbounded but sub-Gaussian queries. This yields significantly tighter guarantees than worst-case DP analyses and provides a practical, data-dependent stability toolkit for adaptive data analysis with simple noise-addition mechanisms. Overall, the work advances understanding of how typical-case information leakage—captured by Bayes factors and PC—governs adaptive generalization, enabling robust, variance-aware guarantees for adaptively queried statistics. can scale with and as shown, and the Gaussian mechanism with carefully tuned noise achieves distribution accuracy under adaptivity without resorting to worst-case range-based noise calibration.

Abstract

Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the empirical evaluation of queries on the sample significantly deviates from their mean with respect to the underlying data distribution. It turns out that simple noise addition algorithms suffice to prevent this issue, and differential privacy-based analysis of these algorithms shows that they can handle an asymptotically optimal number of queries. However, differential privacy's worst-case nature entails scaling such noise to the range of the queries even for highly-concentrated queries, or introducing more complex algorithms. In this paper, we prove that straightforward noise-addition algorithms already provide variance-dependent guarantees that also extend to unbounded queries. This improvement stems from a novel characterization that illuminates the core problem of adaptive data analysis. We show that the harm of adaptivity results from the covariance between the new query and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We then leverage this characterization to introduce a new data-dependent stability notion that can bound this covariance.

Paper Structure

This paper contains 36 sections, 32 theorems, 107 equations, 1 figure.

Key Result

Theorem 1.1

With probability $> 1 - \delta$, the error of the responses produced by a mechanism which only adds Gaussian noise to the empirical values of the queries it receives is bounded by $\epsilon$, even after responding to $k$ adaptively chosen queries, if

Theorems & Definitions (95)

  • Theorem 1.1: Informal versions of main theorems
  • Definition 2.1: Accuracy of a mechanism
  • Definition 2.2: Linear queries
  • Definition 2.3: Gaussian mechanism
  • Lemma 3.1: Sample accuracy implies posterior accuracy
  • Lemma 3.2: Accuracy of Gaussian mechanism
  • Definition 3.3: Bayes stability
  • Theorem 3.4: Generalization
  • Lemma 3.5: Covariance stability
  • proof
  • ...and 85 more