Table of Contents
Fetching ...

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

Deyi Kong, Zaiwei Chen, Shuzhong Zhang, Shancong Mou

TL;DR

NHGD reformulates bilevel optimization by replacing the Hessian inverse with the inverse EFIM, computed in parallel with inner SGD. This natural gradient-style preconditioning yields a parallel, optimize-and-approximate scheme that preserves high-probability convergence guarantees while reducing runtime overhead. Theoretical results provide high-probability bounds for the Hessian-inverse approximation and a finite-sample convergence rate matching state-of-the-art methods, and empirical experiments on hyper-data cleaning, data distillation, and PDE-constrained optimization demonstrate practical scalability and robustness. The approach enables efficient, scalable bilevel learning and suggests avenues for acceleration with blockwise/K-FAC-like structures and extensions to broader inner-objective families.

Abstract

In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation--namely, the need to compute or approximate Hessian inverse--we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

TL;DR

NHGD reformulates bilevel optimization by replacing the Hessian inverse with the inverse EFIM, computed in parallel with inner SGD. This natural gradient-style preconditioning yields a parallel, optimize-and-approximate scheme that preserves high-probability convergence guarantees while reducing runtime overhead. Theoretical results provide high-probability bounds for the Hessian-inverse approximation and a finite-sample convergence rate matching state-of-the-art methods, and empirical experiments on hyper-data cleaning, data distillation, and PDE-constrained optimization demonstrate practical scalability and robustness. The approach enables efficient, scalable bilevel learning and suggests avenues for acceleration with blockwise/K-FAC-like structures and extensions to broader inner-objective families.

Abstract

In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation--namely, the need to compute or approximate Hessian inverse--we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.
Paper Structure (44 sections, 12 theorems, 141 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 44 sections, 12 theorems, 141 equations, 7 figures, 5 tables, 2 algorithms.

Key Result

Theorem 4.7

Suppose Assumptions as:l_inner, as:unbias_grad, as:bound_grad_inner and as:inner_dist hold. For any outer iteration $k$, given the outer variable $v_k$ and set the inner stepsize as $\eta_t = 4/(\mu (t+\frac{8L^2}{\mu^2}))$. For any $\delta \in (0,1)$ and $T\geq T_0(\delta)$, the following holds wit where $T_0(\delta)$ is defined in eq:condition_T0.

Figures (7)

  • Figure 1: Overview of NHGD. The inner problem is solved using SGD on Device 1, while gradient information is sent to Device 2 for iterative rank-one updates of the EFIM inverse $A_t$ and the cross-derivatives $L_t$. After the inner loop, Device 1 sends gradient information to Device 2, which approximates the hypergradient and updates the outer variable. The updated $v_{k+1}$ is then returned to Device 1 to resume the inner optimization. This design enables synchronous hypergradient estimation alongside inner-loop optimization.
  • Figure 2: Test accuracy for the Hyper-data Cleaning task.
  • Figure 3: Test Accuracy for Data Distillation task.
  • Figure 4: Outer loss for PDE Constrained Optimization task
  • Figure 5: Outer loss for the Hyper-data Cleaning task.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Remark 3.1
  • Remark 4.6
  • Theorem 4.7
  • Lemma 4.8
  • Remark 4.9
  • Theorem 4.10
  • Remark 4.11
  • Proposition 4.12
  • Lemma B.1
  • proof
  • ...and 18 more