Bilevel Learning with Inexact Stochastic Gradients
Mohammad Sadegh Salehi, Subhadip Mukherjee, Lindon Roberts, Matthias J. Ehrhardt
TL;DR
This work addresses bilevel optimization with an inexact lower-level solver and stochastic upper-level objective by developing ISGD, which uses inexact stochastic hypergradients $z_v(\theta)=\nabla f_v(\theta)+e_v(\theta)$. It connects these gradients to practical biased SGD (ABC) assumptions and proves convergence under mild conditions, even when lower-level evaluations are inexact. The authors demonstrate speedups and improved generalization in imaging tasks by learning data-adaptive FoE regularizers for denoising and deblurring, showing strong performance on large datasets where deterministic bilevel methods struggle. Overall, the approach provides a scalable, theoretically grounded framework for bilevel learning with inexact gradients, with clear benefits for variational regularization in imaging.
Abstract
Bilevel learning has gained prominence in machine learning, inverse problems, and imaging applications, including hyperparameter optimization, learning data-adaptive regularizers, and optimizing forward operators. The large-scale nature of these problems has led to the development of inexact and computationally efficient methods. Existing adaptive methods predominantly rely on deterministic formulations, while stochastic approaches often adopt a doubly-stochastic framework with impractical variance assumptions, enforces a fixed number of lower-level iterations, and requires extensive tuning. In this work, we focus on bilevel learning with strongly convex lower-level problems and a nonconvex sum-of-functions in the upper-level. Stochasticity arises from data sampling in the upper-level which leads to inexact stochastic hypergradients. We establish their connection to state-of-the-art stochastic optimization theory for nonconvex objectives. Furthermore, we prove the convergence of inexact stochastic bilevel optimization under mild assumptions. Our empirical results highlight significant speed-ups and improved generalization in imaging tasks such as image denoising and deblurring in comparison with adaptive deterministic bilevel methods.
