Online Nonconvex Bilevel Optimization with Bregman Divergences

Jason Bohne; David Rosenberg; Gary Kazantsev; Pawel Polak

Online Nonconvex Bilevel Optimization with Bregman Divergences

Jason Bohne, David Rosenberg, Gary Kazantsev, Pawel Polak

TL;DR

This study addresses the online nonconvex-strongly convex bilevel optimization problem and introduces a novel online Bregman bilevel optimizer (OBBO) that utilizes adaptive Bregman divergences and first stochastic online bilevel optimizer (SOBBO), which employs a window averaging method for updating outer-level variables using a weighted average of recent stochastic approximations of hypergradients.

Abstract

Bilevel optimization methods are increasingly relevant within machine learning, especially for tasks such as hyperparameter optimization and meta-learning. Compared to the offline setting, online bilevel optimization (OBO) offers a more dynamic framework by accommodating time-varying functions and sequentially arriving data. This study addresses the online nonconvex-strongly convex bilevel optimization problem. In deterministic settings, we introduce a novel online Bregman bilevel optimizer (OBBO) that utilizes adaptive Bregman divergences. We demonstrate that OBBO enhances the known sublinear rates for bilevel local regret through a novel hypergradient error decomposition that adapts to the underlying geometry of the problem. In stochastic contexts, we introduce the first stochastic online bilevel optimizer (SOBBO), which employs a window averaging method for updating outer-level variables using a weighted average of recent stochastic approximations of hypergradients. This approach not only achieves sublinear rates of bilevel local regret but also serves as an effective variance reduction strategy, obviating the need for additional stochastic gradient samples at each timestep. Experiments on online hyperparameter optimization and online meta-learning highlight the superior performance, efficiency, and adaptability of our Bregman-based algorithms compared to established online and offline bilevel benchmarks.

Online Nonconvex Bilevel Optimization with Bregman Divergences

TL;DR

Abstract

Paper Structure (26 sections, 29 theorems, 152 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 29 theorems, 152 equations, 7 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Notations and Assumptions
Bregman Proximal Gradient
Bilevel Local Regret
Online Bregman Bilevel Optimizers
Deterministic Algorithm (OBBO)
Stochastic Algorithm (SOBBO)
Bilevel Local Regret Minimization
OBBO Regret
SOBBO Regret
Experimental Results
Conclusions
Notation and Preliminaries
...and 11 more sections

Key Result

Lemma 4.1

(Lemma 2.1 in ghadimi2018approximation) Under Assumption A and C, we have $\forall \boldsymbol{\lambda}\in\mathcal{X}$, $\forall t\in[1,T]$

Figures (7)

Figure 1: Left Panel: Median cumulative local regret of OBBO vs. benchmark algorithms and median deviation bars plotted every 100 rounds. Right Panel: Gradient norm of OBBO (w=25) vs. benchmark algorithms at $t=T$ with $y=x$ line plotted to visualize the improvement OBBO offers in achieving a solution with smaller gradient norm.
Figure 2: Online meta-learning task on FC100 dataset. Left Panel: Improvement with OBBO on cumulative bilevel local regret. Middle Panel: Higher training accuracy with OBBO. Right Panel: Test accuracy: OBBO outperforms SOBOW while achieving OAGD performance with 10x ($w=10$) computationally cheaper update.
Figure 3: Sample training-validation subsets for AMD U.S. Equity with annotated market event on 11-08-2021.
Figure 4: Left Panel: Median cumulative local regret of OBBO vs. online and offline benchmark algorithms across 440 U.S. markets with window size parameter $w=1,25$ and median deviation bars plotted every 100 rounds. Middle Panel: Gradient norm of OBBO (w=25) vs. online and offline benchmark algorithms at $t=T$ with $y=x$ line plotted to visualize the improvement OBBO offers in achieving a smaller gradient norm. Right Panel: Forecasting mean-squared error of OBBO (w=25) vs. online and offline benchmark algorithms with $y=x$ line plotted to visualize the improvement OBBO offers in forecasting loss on a test set.
Figure 5: Example forecasts generated with OBBO vs. online (w=25) and offline benchmark algorithms. Note how OBBO achieves a better fit (i.e., smaller loss) relative to benchmarks on the post-annotation test set.
...and 2 more figures

Theorems & Definitions (39)

Lemma 4.1
Lemma 4.2
Lemma 5.1
Theorem 5.2
Lemma 5.3
Theorem 5.4
Lemma A.1
Lemma A.2
Lemma A.3
Lemma A.4
...and 29 more

Online Nonconvex Bilevel Optimization with Bregman Divergences

TL;DR

Abstract

Online Nonconvex Bilevel Optimization with Bregman Divergences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (39)