Table of Contents
Fetching ...

PETRA: Parallel End-to-end Training with Reversible Architectures

Stéphane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

TL;DR

PETRA tackles memory and synchronization bottlenecks in training deep networks by leveraging reversible architectures and a parallelizable backward pass with approximate input inversion, enabling model-parallel execution with minimal buffering. The method decouples forward and backward computations, exchanges only local activations/gradients, and demonstrates near-linear speedups with the number of stages $J$ and accumulation factor $k$, while maintaining competitive accuracy on CIFAR-10 and ImageNet using RevNets. An open-source PyTorch-based autograd implementation accompanies the approach, highlighting practical applicability. These results position PETRA as a viable path toward scalable training of very large reversible models, with potential extensions to invertible transformers and other domains.

Abstract

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.

PETRA: Parallel End-to-end Training with Reversible Architectures

TL;DR

PETRA tackles memory and synchronization bottlenecks in training deep networks by leveraging reversible architectures and a parallelizable backward pass with approximate input inversion, enabling model-parallel execution with minimal buffering. The method decouples forward and backward computations, exchanges only local activations/gradients, and demonstrates near-linear speedups with the number of stages and accumulation factor , while maintaining competitive accuracy on CIFAR-10 and ImageNet using RevNets. An open-source PyTorch-based autograd implementation accompanies the approach, highlighting practical applicability. These results position PETRA as a viable path toward scalable training of very large reversible models, with potential extensions to invertible transformers and other domains.

Abstract

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.
Paper Structure (31 sections, 6 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of PETRA with standard backpropagation. This approach splits the stages of a model and decouples their forward and backward passes, resulting in a sixfold increase in parallelization speed in this example.
  • Figure 2: Differences between the residual block of a ResNet and its reversible counterpart.(a) Forward of a residual block. (b) Forward and (c) Reverse forward of a reversible residual block. For reversible blocks, as in gomez2017reversible, the input $x_j$ is doubled in size and split equally into $\{x_j^1, x_j^2\}$ along its channels. The function $\mathcal{F}_j$ includes a skip-connection while $\tilde{\mathcal{F}}_j$ does not.
  • Figure 3: Comparison of memory use between PETRA and a standard Delayed Gradient methodzhuang2020accumulated. By avoiding weight stashing and reversing the output into the input during the backward phase, we are able to fully decouple the forward and backward phases in all reversible stages, with no memory overhead, compared to standard delayed gradient approaches.
  • Figure 4: Validation accuracy of PETRA and backpropagation for a various number of accumulation steps, for a RevNet18 trained on ImageNet with $k \in \{1, 2, 4, 8, 16, 32\}$. The validation accuracies are averaged over the last 10 epochs. As the number of accumulation steps increases, the effective staleness in PETRA decreases, closing the gap with standard backpropagation.
  • Figure 5: Cosine similarities and norm ratios between gradients throughout training. Each point represents the average of 15 measurements during 1 epoch. Values are smoothed with a rolling window of size 10. Color corresponds to the stage index. The approximation is noticeably better after the last learning rate drop, and for later stages. Although PETRA approximates well the standard delay gradient, it also approximates better the end-to-end gradient compared to standard delay gradient approaches.
  • ...and 1 more figures