Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Qianli Shen; Yezhen Wang; Zhouhao Yang; Xiang Li; Haonan Wang; Yang Zhang; Jonathan Scarlett; Zhanxing Zhu; Kenji Kawaguchi

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Qianli Shen, Yezhen Wang, Zhouhao Yang, Xiang Li, Haonan Wang, Yang Zhang, Jonathan Scarlett, Zhanxing Zhu, Kenji Kawaguchi

TL;DR

This work tackles the memory and scalability bottlenecks of gradient-based bi-level optimization for large-scale models. It introduces Forward Gradient Unrolling with Forward Gradient, $(\text{FG})^2\text{U}$, an unbiased stochastic meta-gradient estimator whose memory footprint does not scale with the inner unrolled depth $T$ or the meta-parameter dimension $N$, and which is highly amenable to parallelization. A theoretical convergence analysis shows an $O(\epsilon^{-1}\rho^{-1})$ rate with $\rho=b/(N-1)$, and the authors propose a practical two-phase training paradigm that first uses faster, biased methods and then applies FG^2U for accurate refinement; a zeroth-order variant $\text{FG}^2\text{U}$-ZO extends applicability to non-differentiable inner solvers. Empirically, FG^2U demonstrates superior gradient quality and memory efficiency across data condensation, meta-learning for online LM adaptation, and PDE-driven bilevel problems, highlighting its potential to scale bilevel optimization to very large models and distributed settings.

Abstract

Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. Code is available at https://github.com/ShenQianli/FG2U.

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

TL;DR

This work tackles the memory and scalability bottlenecks of gradient-based bi-level optimization for large-scale models. It introduces Forward Gradient Unrolling with Forward Gradient,

, an unbiased stochastic meta-gradient estimator whose memory footprint does not scale with the inner unrolled depth

or the meta-parameter dimension

, and which is highly amenable to parallelization. A theoretical convergence analysis shows an

rate with

, and the authors propose a practical two-phase training paradigm that first uses faster, biased methods and then applies FG^2U for accurate refinement; a zeroth-order variant

-ZO extends applicability to non-differentiable inner solvers. Empirically, FG^2U demonstrates superior gradient quality and memory efficiency across data condensation, meta-learning for online LM adaptation, and PDE-driven bilevel problems, highlighting its potential to scale bilevel optimization to very large models and distributed settings.

Abstract

orward

radient

nrolling with

orward

radient, abbreviated as

, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization.

circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally,

is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice,

and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further,

is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for

, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks. Code is available at https://github.com/ShenQianli/FG2U.

Paper Structure (32 sections, 6 theorems, 62 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 6 theorems, 62 equations, 5 figures, 7 tables, 1 algorithm.

Introduction
Background
$\text{(FG)}^2$U: Forward Gradient Unrolling with Forward Gradient
Convergence
Practical Considerations
Experiments
Conclusion
Acknowledgements
Algorithm
Extended Discussion on Bi-level Optimization
Truncated Reverse Gradient Unrolling (TRGU)
Implicit Function (IF)
Value Function (VF)
Proofs of Theoretical Results
Proof of Lemma \ref{['lemma:variance']}
...and 17 more sections

Key Result

Lemma 3.1

For any ${\boldsymbol{\phi}} \in \Phi$, if ${\boldsymbol{v}}_i\sim \mathrm{Unif}(\{-1, 1\}^N)$, the gradient estimation in eq:forward_gradient_batch, satisfies where $\rho := \frac{b}{N-1} \in (0, 1]$ as the sample size $b$ is selected from $1,\cdots, N-1$.

Figures (5)

Figure 1: Top Left: A comparison of bi-level optimization methods. $\text{(FG)}^2$U circumvents the large-scale challenges inherent in classical bi-level optimization techniques. Within large-scale bi-level optimization, $\text{(FG)}^2$U prioritizes the accuracy of gradient approximation over efficiency. Top Right: An overview of the cost-effective two-phase paradigm. $\text{(FG)}^2$U is ideally positioned in Phase II to enhance performance after an approximate solution has been obtained using other efficient methods. Bottom Left: GPU Memory Usage and Performance on Meta Learning Online Adaptation experiment. $\text{(FG)}^2$U can effectively address the memory issue of RGU when both the inner model and the unrolled depth are large. Bottom Center: GPU Memory Usage and Performance on Data Condensation experiments. The performance of $\text{(FG)}^2$U surpasses that of other large-scale bi-level optimization methods, owing to its accurate gradient approximation, while demonstrating better memory efficiency. Bottom Right: Efficiency tradeoff of $\text{(FG)}^2$U on Data Condensation experiments. The efficiency of $\text{(FG)}^2$U can be well enhanced via intra/inter-GPU parallelism.
Figure 2: Left: Comparison of efficiency between the PINN solver and the numerical solver. We evaluated Adam kingma2014adam and SGD as the inner optimizers for the PINN solver, with steps ranging from 100 to 50,000. The results demonstrate that the numerical solver is significantly more efficient. Right: Comparison of relative L2 errors in the prediction of ${\boldsymbol{\phi}}$ and $u$. $\boldsymbol{\epsilon}_{\boldsymbol{\phi}} = \|{\boldsymbol{\phi}}_{pred} - {\boldsymbol{\phi}}\|_2 / \|{\boldsymbol{\phi}}\|_2$, $\boldsymbol{\epsilon}_u = \|u_{pred} - u\|_2 / \|u\|_2$.
Figure :
Figure B.1: CIFAR100, IPC=50: Inner Loss and gradient norm for Neumann
Figure E.1: Visualization of the 2D latent solutions for the Burgers, Allen-Cahn, and KdV equations. The observed data are sampled on an $8 \times 8$ grid, denoted by white points.

Theorems & Definitions (11)

Lemma 3.1
Theorem 3.4: Convergence
Remark 3.5
Lemma 3.1
proof
Lemma B.2
proof
Lemma B.3
proof
Theorem 3.4: Convergence
...and 1 more

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

TL;DR

Abstract

Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)