Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov; Armin Karamzade; Roy Fox

Moonwalk: Inverse-Forward Differentiation

Dmitrii Krylov, Armin Karamzade, Roy Fox

TL;DR

Moonwalk tackles the memory bottleneck of Backprop in invertible networks by enabling true gradients through forward-mode differentiation. It introduces a two-phase approach that first computes an input gradient $h_0$ and then uses a vector-inverse-Jacobian product to obtain layer-wise gradients, with Pure-Forward and Mixed-Mode variants offering different time-memory trade-offs. Theoretical analysis shows Moonwalk dramatically reduces memory and, in many cases, time, approaching Backprop-like speed when combined with reverse-mode in Mixed-Mode. Empirical results on a CIFAR-10 RevNet demonstrate substantial memory savings and large speedups (e.g., up to 27× faster for 6 layers and 110× faster for 60 layers) while maintaining gradient fidelity and numerical stability. Overall, Moonwalk provides a practical, scalable path to exact gradient computation in invertible networks with far lower memory footprints than Backprop.

Abstract

Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of naïve forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.

Moonwalk: Inverse-Forward Differentiation

TL;DR

and then uses a vector-inverse-Jacobian product to obtain layer-wise gradients, with Pure-Forward and Mixed-Mode variants offering different time-memory trade-offs. Theoretical analysis shows Moonwalk dramatically reduces memory and, in many cases, time, approaching Backprop-like speed when combined with reverse-mode in Mixed-Mode. Empirical results on a CIFAR-10 RevNet demonstrate substantial memory savings and large speedups (e.g., up to 27× faster for 6 layers and 110× faster for 60 layers) while maintaining gradient fidelity and numerical stability. Overall, Moonwalk provides a practical, scalable path to exact gradient computation in invertible networks with far lower memory footprints than Backprop.

Abstract

Paper Structure (19 sections, 10 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 10 equations, 6 figures, 1 table, 1 algorithm.

Introduction
Related Work
Background
Notation
Forward-Mode Gradients
Projected Forward-Mode Gradients
Moonwalk
Pure-Forward Moonwalk
Mixed-Mode Moonwalk
Complexity Analysis
Experiments
Experimental Setup
Implementation
Memory reduction
Computation Time
...and 4 more sections

Figures (6)

Figure 1: The computation flow diagram of Moonwalk: (a) obtaining $h_0$ with the Forward gradients; (b) alternative method for computing $h_0$ with Backprop; and (c) computing the parameter gradients in forward-mode given $h_0$.
Figure 2: Maximum allocated memory during training. The input is padded to 32x32x8, which corresponds to allocating memory for 32x32x8 $\cdot$ number of layers $\cdot$ number of blocks parameters.
Figure 3: Maximum allocated memory during training for a larger network. The input is padded to 32x32x18. The total number of parameters is 200k $\cdot$ number of layers per block.
Figure 4: Time comparison for computing one batch of 512 with an input size of 32x32x18 across three blocks, each with a varying number of layers per block.
Figure 5: Train accuracy of three models trained with RevBackprop, Backprop, and Mixed gradient methods for 100 epochs, averaged over 20 runs.
...and 1 more figures

Moonwalk: Inverse-Forward Differentiation

TL;DR

Abstract

Moonwalk: Inverse-Forward Differentiation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)