Table of Contents
Fetching ...

Minimax Excess Risk of First-Order Methods for Statistical Learning with Data-Dependent Oracles

Kevin Scaman, Mathieu Even, Batiste Le Bars, Laurent Massoulié

TL;DR

This paper develops a unified minimax framework for first-order optimization when gradient information comes from data-dependent oracles, capturing scenarios where training and test distributions differ. Central to the theory is the best approximation error, which connects the difficulty of estimating population-gradient expectations to the minimax excess risk ${\varepsilon}_{\text{sc}}({\mathcal G},{\mathsf O})$, and is analyzed via Le Cam-type arguments. The authors establish sharp lower and upper bounds, show exact equalities for deterministic oracles, and provide refined i.i.d. oracle results with mini-batch and warmup strategies, then instantiate the bounds across supervised, transfer, federated, robust learning, and learning from fixed data-points. The results yield practical guidance on when simple optimization schemes like mini-batch gradient descent with warmup are near-optimal and how distributional shifts influence excess risk, offering a principled bridge between gradient estimation and population-risk minimization. Overall, the framework unifies diverse learning settings under a single minimax lens and highlights how data-dependent gradient access shapes generalization and optimization trade-offs.$

Abstract

In this paper, our aim is to analyse the generalization capabilities of first-order methods for statistical learning in multiple, different yet related, scenarios including supervised learning, transfer learning, robust learning and federated learning. To do so, we provide sharp upper and lower bounds for the minimax excess risk of strongly convex and smooth statistical learning when the gradient is accessed through partial observations given by a data-dependent oracle. This novel class of oracles can query the gradient with any given data distribution, and is thus well suited to scenarios in which the training data distribution does not match the target (or test) distribution. In particular, our upper and lower bounds are proportional to the smallest mean square error achievable by gradient estimators, thus allowing us to easily derive multiple sharp bounds in the aforementioned scenarios using the extensive literature on parameter estimation.

Minimax Excess Risk of First-Order Methods for Statistical Learning with Data-Dependent Oracles

TL;DR

This paper develops a unified minimax framework for first-order optimization when gradient information comes from data-dependent oracles, capturing scenarios where training and test distributions differ. Central to the theory is the best approximation error, which connects the difficulty of estimating population-gradient expectations to the minimax excess risk , and is analyzed via Le Cam-type arguments. The authors establish sharp lower and upper bounds, show exact equalities for deterministic oracles, and provide refined i.i.d. oracle results with mini-batch and warmup strategies, then instantiate the bounds across supervised, transfer, federated, robust learning, and learning from fixed data-points. The results yield practical guidance on when simple optimization schemes like mini-batch gradient descent with warmup are near-optimal and how distributional shifts influence excess risk, offering a principled bridge between gradient estimation and population-risk minimization. Overall, the framework unifies diverse learning settings under a single minimax lens and highlights how data-dependent gradient access shapes generalization and optimization trade-offs.$

Abstract

In this paper, our aim is to analyse the generalization capabilities of first-order methods for statistical learning in multiple, different yet related, scenarios including supervised learning, transfer learning, robust learning and federated learning. To do so, we provide sharp upper and lower bounds for the minimax excess risk of strongly convex and smooth statistical learning when the gradient is accessed through partial observations given by a data-dependent oracle. This novel class of oracles can query the gradient with any given data distribution, and is thus well suited to scenarios in which the training data distribution does not match the target (or test) distribution. In particular, our upper and lower bounds are proportional to the smallest mean square error achievable by gradient estimators, thus allowing us to easily derive multiple sharp bounds in the aforementioned scenarios using the extensive literature on parameter estimation.
Paper Structure (35 sections, 18 theorems, 77 equations, 1 table, 1 algorithm)

This paper contains 35 sections, 18 theorems, 77 equations, 1 table, 1 algorithm.

Key Result

Proposition 1

For any distribution ${\mathcal{D}}$, function class ${\mathcal{G}}$ and data-dependent oracle ${\mathsf O}$ verifying Assumption ass:O, we have

Theorems & Definitions (47)

  • Definition 1: function class
  • Remark 1
  • Example 1: Least squares regression
  • Example 2: Regularized Lipschitz losses
  • Definition 2: Data-dependent oracle
  • Definition 3: Optimization algorithm
  • Definition 4: Best approximation error
  • Example 3: Conditional standard deviation
  • Example 4: Deviations and barycenters
  • Proposition 1
  • ...and 37 more