PETRA: Parallel End-to-end Training with Reversible Architectures
Stéphane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon
TL;DR
PETRA tackles memory and synchronization bottlenecks in training deep networks by leveraging reversible architectures and a parallelizable backward pass with approximate input inversion, enabling model-parallel execution with minimal buffering. The method decouples forward and backward computations, exchanges only local activations/gradients, and demonstrates near-linear speedups with the number of stages $J$ and accumulation factor $k$, while maintaining competitive accuracy on CIFAR-10 and ImageNet using RevNets. An open-source PyTorch-based autograd implementation accompanies the approach, highlighting practical applicability. These results position PETRA as a viable path toward scalable training of very large reversible models, with potential extensions to invertible transformers and other domains.
Abstract
Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep learning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.
