Table of Contents
Fetching ...

Do we really need the Rademacher complexities?

Daniel Bartl, Shahar Mendelson

TL;DR

The paper shows that the sample complexity for convex learning with squared loss is governed by the limiting Gaussian process rather than Rademacher complexities, establishing a universal behavior across problems sharing the same $L_2$ structure, including heavy-tailed settings. It introduces a data-driven learning procedure that uses a crude risk oracle, a fine risk oracle, and a tournament to select a predictor with provable $L_2$-error and excess risk guarantees, even when traditional ERM fails under heavy tails. Central to the approach are unorthodox chaining techniques that blend optimal mean estimation with Talagrand's majorizing measures, along with a distance oracle to connect $L_2$ distances to observable quantities. The results extend to linear regression and more general convex classes, providing fixed-point characterizations through Gaussian-process-based quantities $r_{ Q}$ and $r_{ M}$ and offering a robust alternative to Rademacher-based analyses in practical, heavy-tailed regimes.

Abstract

We study the fundamental problem of learning with respect to the squared loss in a convex class. The state-of-the-art sample complexity estimates in this setting rely on Rademacher complexities, which are generally difficult to control. We prove that, contrary to prevailing belief and under minimal assumptions, the sample complexity is not governed by the Rademacher complexities but rather by the behaviour of the limiting gaussian process. In particular, all such learning problems that have the same $L_2$-structure -- even those with heavy-tailed distributions -- share the same sample complexity. This constitutes the first universality result for general convex learning problems. The proof is based on a novel learning procedure, and its performance is studied by combining optimal mean estimation techniques for real-valued random variables with Talagrand's generic chaining method.

Do we really need the Rademacher complexities?

TL;DR

The paper shows that the sample complexity for convex learning with squared loss is governed by the limiting Gaussian process rather than Rademacher complexities, establishing a universal behavior across problems sharing the same structure, including heavy-tailed settings. It introduces a data-driven learning procedure that uses a crude risk oracle, a fine risk oracle, and a tournament to select a predictor with provable -error and excess risk guarantees, even when traditional ERM fails under heavy tails. Central to the approach are unorthodox chaining techniques that blend optimal mean estimation with Talagrand's majorizing measures, along with a distance oracle to connect distances to observable quantities. The results extend to linear regression and more general convex classes, providing fixed-point characterizations through Gaussian-process-based quantities and and offering a robust alternative to Rademacher-based analyses in practical, heavy-tailed regimes.

Abstract

We study the fundamental problem of learning with respect to the squared loss in a convex class. The state-of-the-art sample complexity estimates in this setting rely on Rademacher complexities, which are generally difficult to control. We prove that, contrary to prevailing belief and under minimal assumptions, the sample complexity is not governed by the Rademacher complexities but rather by the behaviour of the limiting gaussian process. In particular, all such learning problems that have the same -structure -- even those with heavy-tailed distributions -- share the same sample complexity. This constitutes the first universality result for general convex learning problems. The proof is based on a novel learning procedure, and its performance is studied by combining optimal mean estimation techniques for real-valued random variables with Talagrand's generic chaining method.

Paper Structure

This paper contains 17 sections, 17 theorems, 161 equations.

Key Result

Theorem 1.7

There are constants $c,c_0,c_1$ and $c_2$ that depend only on $L$ for which the following holds. Let and fix $r \geq 2r_{\rm rad}^*$. There exists a procedure that, based on the data ${\mathcal{D}}_N=(X_i,Y_i)_{i=1}^N$ and the values of $L$, $\overline{\sigma}$ and $r$, selects a function $\widehat{f} \in F$ which satisfies that with probability at least

Theorems & Definitions (45)

  • Remark 1.1
  • Definition 1.2
  • Definition 1.3
  • Definition 1.4
  • Example 1.6
  • Theorem 1.7
  • Corollary 1.8
  • Remark 1.9
  • Example 1.10: Linear regression in $\mathbb{R}^d$
  • Definition 1.11
  • ...and 35 more