Table of Contents
Fetching ...

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Samuel Bright-Thonney, Thomas R. Harvey, Andre Lukas, Jesse Thaler

Abstract

We introduce Sven (Singular Value dEsceNt), a new optimization algorithm for neural networks that exploits the natural decomposition of loss functions into a sum over individual data points, rather than reducing the full loss to a single scalar before computing a parameter update. Sven treats each data point's residual as a separate condition to be satisfied simultaneously, using the Moore-Penrose pseudoinverse of the loss Jacobian to find the minimum-norm parameter update that best satisfies all conditions at once. In practice, this pseudoinverse is approximated via a truncated singular value decomposition, retaining only the $k$ most significant directions and incurring a computational overhead of only a factor of $k$ relative to stochastic gradient descent. This is in comparison to traditional natural gradient methods, which scale as the square of the number of parameters. We show that Sven can be understood as a natural gradient method generalized to the over-parametrized regime, recovering natural gradient descent in the under-parametrized limit. On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to a lower final loss, while remaining competitive with LBFGS at a fraction of the wall-time cost. We discuss the primary challenge to scaling, namely memory overhead, and propose mitigation strategies. Beyond standard machine learning benchmarks, we anticipate that Sven will find natural application in scientific computing settings where custom loss functions decompose into several conditions.

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Abstract

We introduce Sven (Singular Value dEsceNt), a new optimization algorithm for neural networks that exploits the natural decomposition of loss functions into a sum over individual data points, rather than reducing the full loss to a single scalar before computing a parameter update. Sven treats each data point's residual as a separate condition to be satisfied simultaneously, using the Moore-Penrose pseudoinverse of the loss Jacobian to find the minimum-norm parameter update that best satisfies all conditions at once. In practice, this pseudoinverse is approximated via a truncated singular value decomposition, retaining only the most significant directions and incurring a computational overhead of only a factor of relative to stochastic gradient descent. This is in comparison to traditional natural gradient methods, which scale as the square of the number of parameters. We show that Sven can be understood as a natural gradient method generalized to the over-parametrized regime, recovering natural gradient descent in the under-parametrized limit. On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to a lower final loss, while remaining competitive with LBFGS at a fraction of the wall-time cost. We discuss the primary challenge to scaling, namely memory overhead, and propose mitigation strategies. Beyond standard machine learning benchmarks, we anticipate that Sven will find natural application in scientific computing settings where custom loss functions decompose into several conditions.

Paper Structure

This paper contains 28 sections, 52 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Validation loss as a function of epoch (top) and wall time (bottom) for 1D regression, random polynomial regression, and MNIST classification, comparing Sven against SGD, RMSProp, Adam, and LBFGS. In all tasks, Sven converges faster per epoch and to a lower final loss than all standard first-order methods, remaining competitive with LBFGS despite significantly lower wall-time cost.
  • Figure 2: Training loss vs. epoch for MNIST with varying $k$ (top) and varying rtol (bottom right). The bottom left panel shows the evolution of the singular value spectrum across epochs for all three tasks, illustrating how the rank structure of the loss Jacobian changes during training.
  • Figure 3: Validation loss curves for the toy 1D (left) and polynomial (right) datasets using the micro-batching (top) and parameter batching (bottom) approaches to save on computational cost. Each plot shows the mean trajectory with $\pm 1\sigma$ error bands computed over ten runs with different model initializations.
  • Figure 4: Left, center: training and validation loss trajectories comparing Sven to baseline optimizers on MNIST classification with the cross-entropy loss. Right: Singular value spectra at uniformly sampled training batches comparing the label regression (blue) and cross-entropy (orange) objectives.