Table of Contents
Fetching ...

Bilevel Learning via Inexact Stochastic Gradient Descent

Mohammad Sadegh Salehi, Subhadip Mukherjee, Lindon Roberts, Matthias J. Ehrhardt

TL;DR

Convergence is proved and rates under decaying accuracy and step size schedules are established, showing that with optimal configurations convergence occurs at an $\mathcal{O}(k^{-1/4})$ rate in expectation.

Abstract

Bilevel optimization is a central tool in machine learning for high-dimensional hyperparameter tuning. Its applications are vast; for instance, in imaging it can be used for learning data-adaptive regularizers and optimizing forward operators in variational regularization. These problems are large in many ways: a lot of data is usually available to train a large number of parameters, calling for stochastic gradient-based algorithms. However, exact gradients with respect to parameters (so-called hypergradients) are not available, and their precision is usually linearly related to computational cost. Hence, algorithms must solve the problem efficiently without unnecessary precision. The design of such methods is still not fully understood, especially regarding how accuracy requirements and step size schedules affect theoretical guarantees and practical performance. Existing approaches introduce stochasticity at both the upper level (e.g., in sampling or mini-batch estimates) and the lower level (e.g., in solving the inner problem) to improve generalization, but they typically fix the number of lower-level iterations, which conflicts with asymptotic convergence assumptions. In this work, we advance the theory of inexact stochastic bilevel optimization. We prove convergence and establish rates under decaying accuracy and step size schedules, showing that with optimal configurations convergence occurs at an $\mathcal{O}(k^{-1/4})$ rate in expectation. Experiments on image denoising and inpainting with convex ridge regularizers and input-convex networks confirm our analysis: decreasing step sizes improve stability, accuracy scheduling is more critical than step size strategy, and adaptive preconditioning (e.g., Adam) further boosts performance. These results bridge theory and practice, providing convergence guarantees and practical guidance for large-scale imaging problems.

Bilevel Learning via Inexact Stochastic Gradient Descent

TL;DR

Convergence is proved and rates under decaying accuracy and step size schedules are established, showing that with optimal configurations convergence occurs at an rate in expectation.

Abstract

Bilevel optimization is a central tool in machine learning for high-dimensional hyperparameter tuning. Its applications are vast; for instance, in imaging it can be used for learning data-adaptive regularizers and optimizing forward operators in variational regularization. These problems are large in many ways: a lot of data is usually available to train a large number of parameters, calling for stochastic gradient-based algorithms. However, exact gradients with respect to parameters (so-called hypergradients) are not available, and their precision is usually linearly related to computational cost. Hence, algorithms must solve the problem efficiently without unnecessary precision. The design of such methods is still not fully understood, especially regarding how accuracy requirements and step size schedules affect theoretical guarantees and practical performance. Existing approaches introduce stochasticity at both the upper level (e.g., in sampling or mini-batch estimates) and the lower level (e.g., in solving the inner problem) to improve generalization, but they typically fix the number of lower-level iterations, which conflicts with asymptotic convergence assumptions. In this work, we advance the theory of inexact stochastic bilevel optimization. We prove convergence and establish rates under decaying accuracy and step size schedules, showing that with optimal configurations convergence occurs at an rate in expectation. Experiments on image denoising and inpainting with convex ridge regularizers and input-convex networks confirm our analysis: decreasing step sizes improve stability, accuracy scheduling is more critical than step size strategy, and adaptive preconditioning (e.g., Adam) further boosts performance. These results bridge theory and practice, providing convergence guarantees and practical guidance for large-scale imaging problems.

Paper Structure

This paper contains 18 sections, 11 theorems, 70 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Theorem 2.1

\newlabelthm:preview0 Under Assumptions assumption1--assump:sequences and a sufficiently small initial step size $\alpha_0>0$, the iterates $\{\theta^k\}$ generated by alg:isgd satisfy for some constants $C_1, C_2>0$. If $L_K \to 0$, then $\mathbb{E}[\|\nabla f(\theta^k)\|]\to 0$ as $K \to \infty$. In particular, for step sizes $\alpha_k = \mathcal{O}(k^{-q})$ with $\tfrac{1}{2}<q<1$ and accurac

Figures (7)

  • Figure 1: Training and test results for initial accuracies $\epsilon_0 = 1, \epsilon_0 = 10^{-2}, \epsilon_0 = 10^{-4}$. The first column (a), (d), (g) shows training loss with a fixed upper-level step size. The second column (b), (e), (h) represents training loss with a decreasing upper-level step size with decreasing schedule exponent $q>0$. The last column (c), (f), (i) illustrates test PSNR comparison between the best-performing fixed and decreasing step size configurations.
  • Figure 2: Best-performing fixed and decreasing step size configurations across all accuracy settings $\epsilon_0 \in \{10^{0}, 10^{-2}, 10^{-4}\}$. (a) Training loss plotted against computational cost. (b) Test PSNR plotted against computational cost.
  • Figure 3: Denoising results on test data at training checkpoints corresponding to computational costs of 5,000; 10,000; 50,000; and 100,000. Each row displays outputs from different hyperparameter configurations.
  • Figure 4: Running average training loss and average test PSNR for inpainting, plotted against total computational cost. Results are shown for upper-level step size $\alpha_0 = 10^{-2}$, initial accuracy $\epsilon_0 = 10^{-1}$, accuracy schedule exponent $p \in \{0.5, 1\}$, and step size decay exponent $q \in \{0, 0.25, 0.5\}$. All settings with $p = 1$ appear to perform faster than their $p = 0.5$ counterparts.
  • Figure 5: Inpainting comparison across different decreasing schedule exponent $q \in \{0, 0.25, 0.5 \}$ configurations for fixed upper-level step size, initial accuracy, and accuracy decreasing schedule. Top: ground truth and noisy input image with the zoomed cropped region. Each row shows the zoomed cropped of output across training checkpoints corresponding to computational costs of 2,500; 5,000; 10,000; and 20,000, with the specified hyperparameters.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Theorem 2.1: Convergence Preview
  • Lemma 3.4
  • Lemma 3.5
  • Lemma 3.6
  • Proposition 3.7
  • Lemma 3.8
  • Proof 1
  • Theorem 3.9
  • Proof 2
  • Theorem 3.10
  • ...and 7 more