Discounted Adaptive Online Learning: Towards Better Regularization

Zhiyu Zhang; David Bombara; Heng Yang

Discounted Adaptive Online Learning: Towards Better Regularization

Zhiyu Zhang, David Bombara, Heng Yang

TL;DR

The paper tackles nonstationary adversarial online learning by introducing a discounted regret framework and an adaptive FTRL-based algorithm that achieves instance-optimal performance beyond constant-learning-rate baselines. It employs a rescaling trick to convert scale-free undiscounted guarantees into discounted ones and develops a two-component, simultaneous adaptivity scheme that learns both the direction and magnitude of the comparator via polar decomposition. The framework is extended to online conformal prediction (OCP), where stability-based guarantees yield improved coverage and reduced dependence on unknown horizon or maximal radius. Empirical results in OCP demonstrate favorable coverage, narrower prediction sets, and competitive runtimes compared to strong baselines. Collectively, the work strengthens the link between adaptive regularization and discounted online optimization, with practical implications for lifelong learning and robust uncertainty quantification in nonstationary environments.

Abstract

We study online learning in adversarial nonstationary environments. Since the future can be very different from the past, a critical challenge is to gracefully forget the history while new data comes in. To formalize this intuition, we revisit the discounted regret in online convex optimization, and propose an adaptive (i.e., instance optimal), FTRL-based algorithm that improves the widespread non-adaptive baseline -- gradient descent with a constant learning rate. From a practical perspective, this refines the classical idea of regularization in lifelong learning: we show that designing good regularizers can be guided by the principled theory of adaptive online optimization. Complementing this result, we also consider the (Gibbs and Candès, 2021)-style online conformal prediction problem, where the goal is to sequentially predict the uncertainty sets of a black-box machine learning model. We show that the FTRL nature of our algorithm can simplify the conventional gradient-descent-based analysis, leading to instance-dependent performance guarantees.

Discounted Adaptive Online Learning: Towards Better Regularization

TL;DR

Abstract

Paper Structure (52 sections, 17 theorems, 73 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 52 sections, 17 theorems, 73 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Contribution
Related work
Discounting
Adaptivity
Practical lifelong learning
Notation
Discounted adaptivity
Setting
Discounting as forgetting
Inductive bias
Preliminary
Online Gradient Descent
Rescaling trick
Gradient adaptive OGD
...and 37 more sections

Key Result

Theorem 1

If the loss functions are all $G$-Lipschitz, the diameter of the domain is at most $D$, and the discount factor $\lambda_t=\lambda\in(0,1)$, then OGD with a constant learning rate $\eta_t=\frac{D}{G}\sqrt{1-\lambda^2}$ guarantees for all $T=\Omega(\frac{1}{1-\lambda})$, Conversely, fix any variance budget $V\in(0,G^2H_T]$, and any comparator $u$ such that $u,-u\in\mathcal{X}$. For any algorithm,

Figures (3)

Figure 1: The local coverage (first row), local width (second row), and corresponding corruption level (third row) of our algorithms. Results are obtained using corrupted versions of TinyImageNet, with time-varying corruption level (distribution shift). (Left) Results for sudden changes in corruption level. (Right) Results for gradual changes in corruption level. The distribution shifts every 500 steps. Moving averages are plotted with a window size of 100 time steps ($k=100$).
Figure 2: The average coverage (first row) and average width (second row) as a function of the estimated maximum radius, $D_{\mathrm{est}}$ relative to the true radius $D$. The performance of Sf-Ogd and Saocpbhatnagar2023improved are sensitive to $D_{\mathrm{est}}/D$. Averages are taken over the entire time horizon, where the total time steps $T = 6011$.
Figure 3: The runtime per time step of each algorithm, normalized to the runtime of Simple OGD. The runtime of Saocp is longest due to it being a meta-algorithm that initializes $\textsc{Sf-Ogd}$ on each time step.

Theorems & Definitions (30)

Theorem 1: Abridged Theorem \ref{['theorem:ogd']} and \ref{['theorem:lower']}
Theorem 2
Theorem 3
Example 1
Theorem 4: Main result
Lemma 3.1: Abridged Lemma \ref{['lemma:connecting']}
Theorem 5
Lemma B.1
proof : Proof of Lemma \ref{['lemma:e']}
Theorem 6
...and 20 more

Discounted Adaptive Online Learning: Towards Better Regularization

TL;DR

Abstract

Discounted Adaptive Online Learning: Towards Better Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (30)