Table of Contents
Fetching ...

Beyond Backpropagation: Optimization with Multi-Tangent Forward Gradients

Katharina Flügel, Daniel Coquelin, Marie Weiel, Charlotte Debus, Achim Streit, Markus Götz

TL;DR

This work addresses the bottlenecks of backpropagation by proposing multi-tangent forward gradients as a BP-free alternative. It introduces an orthogonal-projection method to aggregate directional derivatives from multiple random tangents, yielding more accurate gradient approximations within the span of the tangents. Empirical results show that increasing the number of tangents improves both direction and magnitude of the gradient estimate and enhances optimization across synthetic and real networks, though a gap to the true gradient remains for large-scale models. The findings highlight potential for BP-free training with improved parallelism, while underscoring the need for efficient tangent sampling and robust learning-rate strategies to close the remaining gap.

Abstract

The gradients used to train neural networks are typically computed using backpropagation. While an efficient way to obtain exact gradients, backpropagation is computationally expensive, hinders parallelization, and is biologically implausible. Forward gradients are an approach to approximate the gradients from directional derivatives along random tangents computed by forward-mode automatic differentiation. So far, research has focused on using a single tangent per step. This paper provides an in-depth analysis of multi-tangent forward gradients and introduces an improved approach to combining the forward gradients from multiple tangents based on orthogonal projections. We demonstrate that increasing the number of tangents improves both approximation quality and optimization performance across various tasks.

Beyond Backpropagation: Optimization with Multi-Tangent Forward Gradients

TL;DR

This work addresses the bottlenecks of backpropagation by proposing multi-tangent forward gradients as a BP-free alternative. It introduces an orthogonal-projection method to aggregate directional derivatives from multiple random tangents, yielding more accurate gradient approximations within the span of the tangents. Empirical results show that increasing the number of tangents improves both direction and magnitude of the gradient estimate and enhances optimization across synthetic and real networks, though a gap to the true gradient remains for large-scale models. The findings highlight potential for BP-free training with improved parallelism, while underscoring the need for efficient tangent sampling and robust learning-rate strategies to close the remaining gap.

Abstract

The gradients used to train neural networks are typically computed using backpropagation. While an efficient way to obtain exact gradients, backpropagation is computationally expensive, hinders parallelization, and is biologically implausible. Forward gradients are an approach to approximate the gradients from directional derivatives along random tangents computed by forward-mode automatic differentiation. So far, research has focused on using a single tangent per step. This paper provides an in-depth analysis of multi-tangent forward gradients and introduces an improved approach to combining the forward gradients from multiple tangents based on orthogonal projections. We demonstrate that increasing the number of tangents improves both approximation quality and optimization performance across various tasks.

Paper Structure

This paper contains 15 sections, 2 theorems, 10 equations, 6 figures, 1 table.

Key Result

Lemma 1

Let $V=\{v_1, \dots, v_k\}$ be $k$ linearly independent tangents $v_i\in\mathbb{R}^n$ and $U=\text{span}(V)\subseteq\mathbb{R}^n$ the subspace spanned by $V$. For any linear combination $\oplus$ applies $g_V\in U$.

Figures (6)

  • Figure 1: The forward gradient $g_v$ for the tangent $v$ is a projection of the gradient $\nabla f$ on the (1D) subspace spanned by $v$. It is by definition always within 90 of $\nabla f$ and thus a descending direction of $f$.
  • Figure 2: Orthogonal projection for $n=3$ and $k=2$. The tangents $v_1$ and $v_2$ span a two-dimensional plane $U$. The gradient $\nabla f$ does not lie within this plane, but its orthogonal projection $P_{U}(\nabla f)$ provides the closest approximation of $\nabla f$ in $U$.
  • Figure 3: The approximation quality of different forward gradient approaches for $n=64$ in terms of cosine similarity and relative vector norm, mean over 1000.0 seeds. As the cosine similarity of the conical combinations is identical, we use dashed lines to better visualize the overlapping curves.
  • Figure 4: The best value found when minimizing $f:\mathbb{R}^n\to\mathbb{R}$ with different gradient approximations, mean over five random seeds. To improve clarity for Styblinski-Tang, we plot $f(x)/n$ instead, as the global minimum $-39.17n$ scales with $n$, and cut off the initial value (0.0 for all $n$) to zoom in on the relevant data.
  • Figure 5: Cosine similarity (mean) and result on Styblinski-Tang with $n=64$ (mean and ci) for different tangent angles $\alpha$ over 1000.0 and five seeds respectively.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1: Forward Gradient
  • proof
  • Definition 2: Multi-Tangent Forward Gradient
  • Lemma 1
  • proof
  • Definition 3
  • Definition 4
  • Definition 5: Forward Orthogonal Gradient
  • Lemma 2
  • proof