Table of Contents
Fetching ...

Gradient Equilibrium in Online Learning: Theory and Applications

Anastasios N. Angelopoulos, Michael I. Jordan, Ryan J. Tibshirani

TL;DR

Gradient equilibrium offers a principled, non-stochastic lens for online learning: if the average gradient along the learner’s path tends to zero, the sequence satisfies a sequential first-order condition with interpretable implications (e.g., unbiasedness, coverage, and debiasing) across regression, classification, and calibration tasks. The authors show gradient descent with constant step sizes achieves GEQ under mild conditions (bounded or slowly growing iterates) and extend the theory to regularization and arbitrary step sizes, connecting GEQ to classical monotonicity and co-coercivity concepts. They demonstrate broad, practical consequences including debiasing black-box predictions under distribution shift, calibrating quantiles, and deriving unbiased Elo scores in pairwise preference problems, with concrete experiments on COMPAS, HelpSteer2, Chatbot Arena, and MIMIC datasets. The work positions GEQ as a versatile tool complementary to regret, enabling online debiasing and calibration without stochastic assumptions, and suggests fruitful directions in online conformal prediction, multiaccuracy, and control theory.

Abstract

We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.

Gradient Equilibrium in Online Learning: Theory and Applications

TL;DR

Gradient equilibrium offers a principled, non-stochastic lens for online learning: if the average gradient along the learner’s path tends to zero, the sequence satisfies a sequential first-order condition with interpretable implications (e.g., unbiasedness, coverage, and debiasing) across regression, classification, and calibration tasks. The authors show gradient descent with constant step sizes achieves GEQ under mild conditions (bounded or slowly growing iterates) and extend the theory to regularization and arbitrary step sizes, connecting GEQ to classical monotonicity and co-coercivity concepts. They demonstrate broad, practical consequences including debiasing black-box predictions under distribution shift, calibrating quantiles, and deriving unbiased Elo scores in pairwise preference problems, with concrete experiments on COMPAS, HelpSteer2, Chatbot Arena, and MIMIC datasets. The work positions GEQ as a versatile tool complementary to regret, enabling online debiasing and calibration without stochastic assumptions, and suggests fruitful directions in online conformal prediction, multiaccuracy, and control theory.

Abstract

We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.
Paper Structure (74 sections, 25 theorems, 196 equations, 10 figures, 1 table, 5 algorithms)

This paper contains 74 sections, 25 theorems, 196 equations, 10 figures, 1 table, 5 algorithms.

Key Result

Proposition 1

For any sequence $y_t$, $t = 1,\dots,T$, denote its sample mean and variance by $\bar{y}_T = \frac{1}{T} \sum_{t=1}^T y_t$ and $s_T^2 = \frac{1}{T} \sum_{t=1}^T (y_t - \bar{y}_T)^2$. There exists a sequence $\theta_t$, $t = 1,\dots,T$ such that

Figures (10)

  • Figure 1: Multigroup debiasing results on the MIMIC dataset, on predicting length-of-stay of patients in a hospital system in Boston, Massachusetts. We train an XGBoost model on a large number of features from this dataset and run our multigroup debiasing procedure with respect to ethnicity (top row) and marital status (bottom row), with each column showing a different learning rate. Gradient equilibrium (third row of Table \ref{['tab:grad_eq']} where the features are group indicators) for this problem says that we achieve zero bias for each ethnicity and marital status, in the long run. [pdfnewwindow=true]https://github.com/aangelopoulos/gradient-equilibrium/blob/main/mimic_stay/multigroup.ipynb
  • Figure 3: Regret and bias for gradient descent on squared losses, with constant step sizes.
  • Figure 4: Two examples with $\ell_t(\theta) = |\theta|$ which show that neither NR nor GEQ necessarily implies the other. In each panel, the iterates start at the upper right-most point, and the thin gray lines are simply as visual aid to demonstrate the order of the sequence.
  • Figure 5: Illustration of a restorative field, where each gold arrow represents the negative gradient of $\ell_t$ at a particular point. Note that if we take this point to be $\theta_t$, then the gradient descent update would move $\theta_{t+1}$ in the direction of arrow. The gold arrows need to point inwards outside of a radius $h$; within the radius, the field can be arbitrary, so it is not drawn.
  • Figure 7: Statistics of the COMPAS dataset. On the left-hand side is a calibration plot showing the predicted recidivism rate on the horizontal axis and the true recidivism rate---conditionally on the predicted one---on the vertical axis. The COMPAS algorithm tends to overpredict recidivism for African-Americans as the predicted recidivism rate grows, as compared to Caucasians and Hispanics. On the top right, we show a rolling average of the race distribution over time, with window size 100 (in terms of individuals screened). On the bottom right, we show a rolling average of the prediction bias in time, again with window size 100. A positive bias indicates an overprediction of recidivism. Thus, in general, the algorithm is underpredicting recidivism. The drastic drop in bias towards the end is an artifact of the dataset---the distribution changes drastically so that almost all individuals screened are recidivist.
  • ...and 5 more figures

Theorems & Definitions (43)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Proposition 7
  • ...and 33 more