Table of Contents
Fetching ...

Bilevel Learning with Inexact Stochastic Gradients

Mohammad Sadegh Salehi, Subhadip Mukherjee, Lindon Roberts, Matthias J. Ehrhardt

TL;DR

This work addresses bilevel optimization with an inexact lower-level solver and stochastic upper-level objective by developing ISGD, which uses inexact stochastic hypergradients $z_v(\theta)=\nabla f_v(\theta)+e_v(\theta)$. It connects these gradients to practical biased SGD (ABC) assumptions and proves convergence under mild conditions, even when lower-level evaluations are inexact. The authors demonstrate speedups and improved generalization in imaging tasks by learning data-adaptive FoE regularizers for denoising and deblurring, showing strong performance on large datasets where deterministic bilevel methods struggle. Overall, the approach provides a scalable, theoretically grounded framework for bilevel learning with inexact gradients, with clear benefits for variational regularization in imaging.

Abstract

Bilevel learning has gained prominence in machine learning, inverse problems, and imaging applications, including hyperparameter optimization, learning data-adaptive regularizers, and optimizing forward operators. The large-scale nature of these problems has led to the development of inexact and computationally efficient methods. Existing adaptive methods predominantly rely on deterministic formulations, while stochastic approaches often adopt a doubly-stochastic framework with impractical variance assumptions, enforces a fixed number of lower-level iterations, and requires extensive tuning. In this work, we focus on bilevel learning with strongly convex lower-level problems and a nonconvex sum-of-functions in the upper-level. Stochasticity arises from data sampling in the upper-level which leads to inexact stochastic hypergradients. We establish their connection to state-of-the-art stochastic optimization theory for nonconvex objectives. Furthermore, we prove the convergence of inexact stochastic bilevel optimization under mild assumptions. Our empirical results highlight significant speed-ups and improved generalization in imaging tasks such as image denoising and deblurring in comparison with adaptive deterministic bilevel methods.

Bilevel Learning with Inexact Stochastic Gradients

TL;DR

This work addresses bilevel optimization with an inexact lower-level solver and stochastic upper-level objective by developing ISGD, which uses inexact stochastic hypergradients . It connects these gradients to practical biased SGD (ABC) assumptions and proves convergence under mild conditions, even when lower-level evaluations are inexact. The authors demonstrate speedups and improved generalization in imaging tasks by learning data-adaptive FoE regularizers for denoising and deblurring, showing strong performance on large datasets where deterministic bilevel methods struggle. Overall, the approach provides a scalable, theoretically grounded framework for bilevel learning with inexact gradients, with clear benefits for variational regularization in imaging.

Abstract

Bilevel learning has gained prominence in machine learning, inverse problems, and imaging applications, including hyperparameter optimization, learning data-adaptive regularizers, and optimizing forward operators. The large-scale nature of these problems has led to the development of inexact and computationally efficient methods. Existing adaptive methods predominantly rely on deterministic formulations, while stochastic approaches often adopt a doubly-stochastic framework with impractical variance assumptions, enforces a fixed number of lower-level iterations, and requires extensive tuning. In this work, we focus on bilevel learning with strongly convex lower-level problems and a nonconvex sum-of-functions in the upper-level. Stochasticity arises from data sampling in the upper-level which leads to inexact stochastic hypergradients. We establish their connection to state-of-the-art stochastic optimization theory for nonconvex objectives. Furthermore, we prove the convergence of inexact stochastic bilevel optimization under mild assumptions. Our empirical results highlight significant speed-ups and improved generalization in imaging tasks such as image denoising and deblurring in comparison with adaptive deterministic bilevel methods.

Paper Structure

This paper contains 8 sections, 6 theorems, 26 equations, 9 figures, 1 algorithm.

Key Result

theorem thmcountertheorem

Convergence: Let assumption1 hold and $\mathbb{E}[\epsilon_k^2]$ be bounded above for all $k\geq 0$. Let $\delta_0 \overset{\text{def}}{=} f(\theta^0) - f^*$. There exist constants $c_1,c_2,\dots,c_5>0$ such that if the step size satisfies $0 < \alpha \leq \frac{c_1}{L_{\nabla f}}$, then the iterate

Figures (9)

  • Figure 1: Loss per mini-batch versus total lower-level computational cost for ISGD with a constant step size, $\alpha_k = \alpha_0$, (left), and a decreasing step size (DS), $\alpha_k = \frac{\alpha_0}{\sqrt{k}}$, (right). While the decreasing step size variant exhibits more stability, a fixed small step size achieves comparable performance.
  • Figure 2: Comparison of ISGD (with and without decreasing step sizes) and MAID in terms of loss per mini-batch for ISGD variants and upper-level loss for MAID, plotted against computational cost, as well as average training PSNR per epoch vs total computations.
  • Figure 4: Comparison on test dataset between the average PSNR per MAID iteration (full-batch) and average PSNR per epoch of ISGD with decreasing step size $\alpha_k = \frac{\alpha_0}{\sqrt{k}}$.
  • Figure 5: Comparison of deblurred images using the learned FoE regularizer, learned by MAID and ISGD, on test images.
  • Figure :
  • ...and 4 more figures

Theorems & Definitions (13)

  • theorem thmcountertheorem
  • corollary thmcountercorollary
  • lemma thmcounterlemma
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • proposition thmcounterproposition
  • proof
  • remark thmcounterremark
  • ...and 3 more