Table of Contents
Fetching ...

On the Performance of Empirical Risk Minimization with Smoothed Data

Adam Block, Alexander Rakhlin, Abhishek Shetty

TL;DR

The paper studies Empirical Risk Minimization (ERM) for squared loss in a smoothed online learning setting with an unknown base measure, showing that ERM achieves sublinear cumulative error when data are $\sigma$-smooth and realizable. It introduces a novel combination of tangent-sequence decoupling, a sharp norm comparison bound for dependent data via the Will's functional, and a symmetrization technique to control ERM performance without knowledge of $\mu$. The main result provides a bound $\mathbb{E}[\mathrm{Err}_T] \le \tilde{O}\left( \sigma^{-1} \sqrt{ (1+\nu) T \big( 1+ \log \mathbb{E}_{\mu}[ W_{2T\log(T)/\sigma}(256 \cdot \mathcal{F}) ] \big)} \right)$, with explicit instantiations for parametric classes yielding $\tilde{O}(\sigma^{-1} \sqrt{dT})$, and a matching lower bound showing near-tightness relative to the complexity of $\mathcal{F}$. In addition to proving that ERM remains effective when the base measure is unknown, the work demonstrates an oracle-efficient learning strategy in this regime and highlights a genuine separation between smoothed and iid data via a lower bound. These results have implications for contextual bandits and structured sequential decision problems where the data-generating distribution is smooth but not fully known.

Abstract

In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F)\cdot T} )$, where $\mathrm{comp}(\mathcal F)$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.

On the Performance of Empirical Risk Minimization with Smoothed Data

TL;DR

The paper studies Empirical Risk Minimization (ERM) for squared loss in a smoothed online learning setting with an unknown base measure, showing that ERM achieves sublinear cumulative error when data are -smooth and realizable. It introduces a novel combination of tangent-sequence decoupling, a sharp norm comparison bound for dependent data via the Will's functional, and a symmetrization technique to control ERM performance without knowledge of . The main result provides a bound , with explicit instantiations for parametric classes yielding , and a matching lower bound showing near-tightness relative to the complexity of . In addition to proving that ERM remains effective when the base measure is unknown, the work demonstrates an oracle-efficient learning strategy in this regime and highlights a genuine separation between smoothed and iid data via a lower bound. These results have implications for contextual bandits and structured sequential decision problems where the data-generating distribution is smooth but not fully known.

Abstract

In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as , where is the statistical complexity of learning with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.
Paper Structure (29 sections, 22 theorems, 128 equations)

This paper contains 29 sections, 22 theorems, 128 equations.

Key Result

Lemma 1

Let $X_1, \dots, X_T$ be $\sigma$-smooth with respect to $\mu$. Then for all $k \in \mathbb{N}$, there exists a coupling of $X_1, \dots, X_T$ with random variables $\left\{ Z_{t,j} | t \in [T], \, j \in [k] \right\}$ such that the $Z_{t,j} \sim \mu$ are independent and there is an event $\mathcal{E}

Theorems & Definitions (44)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 1
  • Definition 6
  • Theorem 1
  • Remark 1
  • Theorem 2
  • ...and 34 more