Table of Contents
Fetching ...

Towards Scalable Backpropagation-Free Gradient Estimation

Daniel Wang, Evan Markou, Dylan Campbell

TL;DR

The paper tackles the memory and scalability limitations of backpropagation by proposing a forward-gradient estimator that reduces both bias and variance through orthogonalising the upstream Jacobian, leveraging the observed low-dimensional structure of neural gradients. The core idea, \tilde{W}^\perp, constrains the guessing space to a low-rank subspace, with a Newton–Schulz variant (\tilde{W}^\perp-NS) offering faster approximations. Experiments on a three-hidden-layer MLP show that small subspace dimensions (e.g., $k=10$) often yield the best performance, and that scaling to larger widths preserves the advantage over prior methods, while a preconditioning approach that eliminates bias at the cost of variance underperforms. The work demonstrates a viable path toward scalable, memory-efficient gradient estimation that could extend to more complex architectures and inform understanding of gradient subspace structure in deep networks.

Abstract

While backpropagation--reverse-mode automatic differentiation--has been extraordinarily successful in deep learning, it requires two passes (forward and backward) through the neural network and the storage of intermediate activations. Existing gradient estimation methods that instead use forward-mode automatic differentiation struggle to scale beyond small networks due to the high variance of the estimates. Efforts to mitigate this have so far introduced significant bias to the estimates, reducing their utility. We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions. It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased. Our understanding of this method is facilitated by analyses of bias and variance, and their connection to the low-dimensional structure of neural network gradients.

Towards Scalable Backpropagation-Free Gradient Estimation

TL;DR

The paper tackles the memory and scalability limitations of backpropagation by proposing a forward-gradient estimator that reduces both bias and variance through orthogonalising the upstream Jacobian, leveraging the observed low-dimensional structure of neural gradients. The core idea, \tilde{W}^\perp, constrains the guessing space to a low-rank subspace, with a Newton–Schulz variant (\tilde{W}^\perp-NS) offering faster approximations. Experiments on a three-hidden-layer MLP show that small subspace dimensions (e.g., ) often yield the best performance, and that scaling to larger widths preserves the advantage over prior methods, while a preconditioning approach that eliminates bias at the cost of variance underperforms. The work demonstrates a viable path toward scalable, memory-efficient gradient estimation that could extend to more complex architectures and inform understanding of gradient subspace structure in deep networks.

Abstract

While backpropagation--reverse-mode automatic differentiation--has been extraordinarily successful in deep learning, it requires two passes (forward and backward) through the neural network and the storage of intermediate activations. Existing gradient estimation methods that instead use forward-mode automatic differentiation struggle to scale beyond small networks due to the high variance of the estimates. Efforts to mitigate this have so far introduced significant bias to the estimates, reducing their utility. We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions. It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased. Our understanding of this method is facilitated by analyses of bias and variance, and their connection to the low-dimensional structure of neural network gradients.

Paper Structure

This paper contains 25 sections, 16 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overlap of the true gradient $\frac{\partial L}{\partial s}$ onto the $k$-dimensional subspace of $\mathrm{Im}({\Tilde{W}}^\mathsf{T})$ corresponding to the $k$ largest singular values of $W$, for layer widths of 128 and 512. In both cases, $\frac{\partial L}{\partial s}$ predominantly lies in a subspace of much lower dimension than $\mathrm{rank}(W)$; around $k=10$ captures most of the gradient, reducing the guessing space while introducing minimal bias. Note that the Layer 3 weight matrix is low rank since it is connected to the output layer of width 10.
  • Figure 2: Training accuracy as the hidden layer width increases.