Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

Deyi Kong; Zaiwei Chen; Shuzhong Zhang; Shancong Mou

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

Deyi Kong, Zaiwei Chen, Shuzhong Zhang, Shancong Mou

TL;DR

NHGD reformulates bilevel optimization by replacing the Hessian inverse with the inverse EFIM, computed in parallel with inner SGD. This natural gradient-style preconditioning yields a parallel, optimize-and-approximate scheme that preserves high-probability convergence guarantees while reducing runtime overhead. Theoretical results provide high-probability bounds for the Hessian-inverse approximation and a finite-sample convergence rate matching state-of-the-art methods, and empirical experiments on hyper-data cleaning, data distillation, and PDE-constrained optimization demonstrate practical scalability and robustness. The approach enables efficient, scalable bilevel learning and suggests avenues for acceleration with blockwise/K-FAC-like structures and extensions to broader inner-objective families.

Abstract

In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation--namely, the need to compute or approximate Hessian inverse--we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

TL;DR

Abstract

Paper Structure (44 sections, 12 theorems, 141 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 44 sections, 12 theorems, 141 equations, 7 figures, 5 tables, 2 algorithms.

Introduction
Contributions.
Related Works
Hypergradient Descent
Other Hessian Inverse Approximation Methods
Natural Gradient Descent
Natural Hypergradient Descent Algorithm
SGD algorithm for inner problem:
Iterative approximation of the Hessian inverse:
Iterative approximation of the cross-partial derivative:
Practical Considerations
Convergence Analysis
Analysis for Hessian Inverse Approximation Bound
Hessian Inverse Approximation Bound
Proof Sketch of Theorem \ref{['theorem:hessian_inv_approx']}
...and 29 more sections

Key Result

Theorem 4.7

Suppose Assumptions as:l_inner, as:unbias_grad, as:bound_grad_inner and as:inner_dist hold. For any outer iteration $k$, given the outer variable $v_k$ and set the inner stepsize as $\eta_t = 4/(\mu (t+\frac{8L^2}{\mu^2}))$. For any $\delta \in (0,1)$ and $T\geq T_0(\delta)$, the following holds wit where $T_0(\delta)$ is defined in eq:condition_T0.

Figures (7)

Figure 1: Overview of NHGD. The inner problem is solved using SGD on Device 1, while gradient information is sent to Device 2 for iterative rank-one updates of the EFIM inverse $A_t$ and the cross-derivatives $L_t$. After the inner loop, Device 1 sends gradient information to Device 2, which approximates the hypergradient and updates the outer variable. The updated $v_{k+1}$ is then returned to Device 1 to resume the inner optimization. This design enables synchronous hypergradient estimation alongside inner-loop optimization.
Figure 2: Test accuracy for the Hyper-data Cleaning task.
Figure 3: Test Accuracy for Data Distillation task.
Figure 4: Outer loss for PDE Constrained Optimization task
Figure 5: Outer loss for the Hyper-data Cleaning task.
...and 2 more figures

Theorems & Definitions (28)

Remark 3.1
Remark 4.6
Theorem 4.7
Lemma 4.8
Remark 4.9
Theorem 4.10
Remark 4.11
Proposition 4.12
Lemma B.1
proof
...and 18 more

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

TL;DR

Abstract

Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (28)