Table of Contents
Fetching ...

A Saddle Point Remedy: Power of Variable Elimination in Non-convex Optimization

Min Gan, Guang-Yong Chen, Yang Yi, Lin Yang

TL;DR

This work provides a principled geometric explanation for the effectiveness of variable elimination in non-convex optimization. By analyzing the relationship between original and reduced landscapes through Hessian inertia and the Schur complement, it shows that variable elimination preserves minima while converting certain saddles into maxima, thereby simplifying the energy landscape. The rank-1 and rank-$r$ matrix factorization examples, including a Grassmannian formulation, illustrate how saddle points are reshaped in the reduced space, and numerical experiments on shallow networks and deep ResNets validate faster convergence and more robust solutions. The results offer a general design principle for robust optimization algorithms in machine learning by actively transforming troublesome saddles into easily escapable maxima.

Abstract

The proliferation of saddle points, rather than poor local minima, is increasingly understood to be a primary obstacle in large-scale non-convex optimization for machine learning. Variable elimination algorithms, like Variable Projection (VarPro), have long been observed to exhibit superior convergence and robustness in practice, yet a principled understanding of why they so effectively navigate these complex energy landscapes has remained elusive. In this work, we provide a rigorous geometric explanation by comparing the optimization landscapes of the original and reduced formulations. Through a rigorous analysis based on Hessian inertia and the Schur complement, we prove that variable elimination fundamentally reshapes the critical point structure of the objective function, revealing that local maxima in the reduced landscape are created from, and correspond directly to, saddle points in the original formulation. Our findings are illustrated on the canonical problem of non-convex matrix factorization, visualized directly on two-parameter neural networks, and finally validated in training deep Residual Networks, where our approach yields dramatic improvements in stability and convergence to superior minima. This work goes beyond explaining an existing method; it establishes landscape simplification via saddle point transformation as a powerful principle that can guide the design of a new generation of more robust and efficient optimization algorithms.

A Saddle Point Remedy: Power of Variable Elimination in Non-convex Optimization

TL;DR

This work provides a principled geometric explanation for the effectiveness of variable elimination in non-convex optimization. By analyzing the relationship between original and reduced landscapes through Hessian inertia and the Schur complement, it shows that variable elimination preserves minima while converting certain saddles into maxima, thereby simplifying the energy landscape. The rank-1 and rank- matrix factorization examples, including a Grassmannian formulation, illustrate how saddle points are reshaped in the reduced space, and numerical experiments on shallow networks and deep ResNets validate faster convergence and more robust solutions. The results offer a general design principle for robust optimization algorithms in machine learning by actively transforming troublesome saddles into easily escapable maxima.

Abstract

The proliferation of saddle points, rather than poor local minima, is increasingly understood to be a primary obstacle in large-scale non-convex optimization for machine learning. Variable elimination algorithms, like Variable Projection (VarPro), have long been observed to exhibit superior convergence and robustness in practice, yet a principled understanding of why they so effectively navigate these complex energy landscapes has remained elusive. In this work, we provide a rigorous geometric explanation by comparing the optimization landscapes of the original and reduced formulations. Through a rigorous analysis based on Hessian inertia and the Schur complement, we prove that variable elimination fundamentally reshapes the critical point structure of the objective function, revealing that local maxima in the reduced landscape are created from, and correspond directly to, saddle points in the original formulation. Our findings are illustrated on the canonical problem of non-convex matrix factorization, visualized directly on two-parameter neural networks, and finally validated in training deep Residual Networks, where our approach yields dramatic improvements in stability and convergence to superior minima. This work goes beyond explaining an existing method; it establishes landscape simplification via saddle point transformation as a powerful principle that can guide the design of a new generation of more robust and efficient optimization algorithms.

Paper Structure

This paper contains 18 sections, 8 theorems, 76 equations, 6 figures.

Key Result

Theorem 1

Let $F: \mathbb{R}^p \times \mathbb{R}^q \to \mathbb{R}$ be a $C^1$ function. Assume that for each $\boldsymbol{x} \in \mathbb{R}^p$, the inner problem $\min_{\boldsymbol{y} \in \mathbb{R}^q} F(\boldsymbol{x},\boldsymbol{y})$ has a unique solution $\boldsymbol{y}^{*}(\boldsymbol{x})$, and that $\bol

Figures (6)

  • Figure 1: Optimization paths of gradient descent applied to the original Rosenbrock function and the reduced function from the same initial point (-1.5, 2.25). The blue path represents the conventional approach, which exhibits certain oscillations and twists at the beginning and then creeps along the flat valley. The red path represents the variable elimination approach, which quickly reaches the minimum in only a few steps.
  • Figure 2: Variable elimination turns a saddle point to a local maximum. Left: the original function $f(x,y)=\frac{1}{3}x^3 + y^2 + 2xy - 6x - 3y + 4$ which has a saddle point and a local minimum. Right: the reduced function $\varphi(x) = \frac{1}{3}x^3 - x^2 - 3x + \frac{7}{4}$ which has a local maximum and a local minimum. The variable elimination approach search along the black line in the original space and is reflect in the left figure by blue line.
  • Figure 3: The 3D landscape of the reduced objective function $f(\boldsymbol{x})$ for the matrix $\boldsymbol{M} = (0,1; 0,0)$. The function's value depends only on the direction of $\boldsymbol{x}$, not its magnitude. The valley along the $x_1$-axis corresponds to the global minima, while the ridge along the $x_2$-axis corresponds to the global maxima.
  • Figure 4: Visualization of the original cost function and the reduced cost function for the MLP (top row) and RBF (bottom row) models. Left: The 3D surface of the original cost function. Middle: The contour plot of the original cost function, where the red curve represents the manifold of optimal linear parameters. Right: The corresponding one-dimensional reduced cost function. Arrows explicitly illustrate the mapping of stationary points.
  • Figure 5: A box plot comparing the distribution of the final residual sum of squares (RSS) for the joint Levenberg-Marquardt (Joint LM) algorithm and the Variable Projection (VarPro) algorithm after 100 training runs from random initializations. The VarPro method demonstrates significantly lower median error and superior consistency.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Theorem 1: Critical Points Preservation (Reduced $\Rightarrow$ Original)
  • Theorem 2: Global Minimizers Preservation (Reduced $\Rightarrow$ Original)
  • Remark 1
  • Theorem 3: Critical Points Preservation (Original $\Rightarrow$ Reduced)
  • Remark 2
  • Proposition 1
  • Proposition 2: Schur Complement Condition for Definiteness wiki:schur
  • Theorem 4: Preservation of Minima
  • Remark 3
  • Theorem 5: Transformation of Saddle Points to Local Maxima
  • ...and 3 more