On the Performance of Empirical Risk Minimization with Smoothed Data

Adam Block; Alexander Rakhlin; Abhishek Shetty

On the Performance of Empirical Risk Minimization with Smoothed Data

Adam Block, Alexander Rakhlin, Abhishek Shetty

TL;DR

The paper studies Empirical Risk Minimization (ERM) for squared loss in a smoothed online learning setting with an unknown base measure, showing that ERM achieves sublinear cumulative error when data are $\sigma$-smooth and realizable. It introduces a novel combination of tangent-sequence decoupling, a sharp norm comparison bound for dependent data via the Will's functional, and a symmetrization technique to control ERM performance without knowledge of $\mu$. The main result provides a bound $\mathbb{E}[\mathrm{Err}_T] \le \tilde{O}\left( \sigma^{-1} \sqrt{ (1+\nu) T \big( 1+ \log \mathbb{E}_{\mu}[ W_{2T\log(T)/\sigma}(256 \cdot \mathcal{F}) ] \big)} \right)$, with explicit instantiations for parametric classes yielding $\tilde{O}(\sigma^{-1} \sqrt{dT})$, and a matching lower bound showing near-tightness relative to the complexity of $\mathcal{F}$. In addition to proving that ERM remains effective when the base measure is unknown, the work demonstrates an oracle-efficient learning strategy in this regime and highlights a genuine separation between smoothed and iid data via a lower bound. These results have implications for contextual bandits and structured sequential decision problems where the data-generating distribution is smooth but not fully known.

Abstract

In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F)\cdot T} )$, where $\mathrm{comp}(\mathcal F)$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.

On the Performance of Empirical Risk Minimization with Smoothed Data

TL;DR

-smooth and realizable. It introduces a novel combination of tangent-sequence decoupling, a sharp norm comparison bound for dependent data via the Will's functional, and a symmetrization technique to control ERM performance without knowledge of

. The main result provides a bound

, with explicit instantiations for parametric classes yielding

, and a matching lower bound showing near-tightness relative to the complexity of

. In addition to proving that ERM remains effective when the base measure is unknown, the work demonstrates an oracle-efficient learning strategy in this regime and highlights a genuine separation between smoothed and iid data via a lower bound. These results have implications for contextual bandits and structured sequential decision problems where the data-generating distribution is smooth but not fully known.

Abstract

, where

is the statistical complexity of learning

with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.

Paper Structure (29 sections, 22 theorems, 128 equations)

This paper contains 29 sections, 22 theorems, 128 equations.

Introduction
Notation and Preliminaries
Problem Formulation and Smoothness
Measures of Complexity of a Function Class
Additional Prerequisites
Notation.
Main Results
Analysis Techniques
Proof Sketch of Theorem \ref{['thm:main']}
Decoupling.
Symmetrization.
Proof Sketch of Theorem \ref{['thm:tight_norm_comparison']}
Lower Bound for ERM
Related Work
Smoothed Online Learning.
...and 14 more sections

Key Result

Lemma 1

Let $X_1, \dots, X_T$ be $\sigma$-smooth with respect to $\mu$. Then for all $k \in \mathbb{N}$, there exists a coupling of $X_1, \dots, X_T$ with random variables $\left\{ Z_{t,j} | t \in [T], \, j \in [k] \right\}$ such that the $Z_{t,j} \sim \mu$ are independent and there is an event $\mathcal{E}

Theorems & Definitions (44)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5
Lemma 1
Definition 6
Theorem 1
Remark 1
Theorem 2
...and 34 more

On the Performance of Empirical Risk Minimization with Smoothed Data

TL;DR

Abstract

On the Performance of Empirical Risk Minimization with Smoothed Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (44)