PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Lei Guan; Dongsheng Li; Yongle Chen; Jiye Liang; Wenjian Wang; Xicheng Lu

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Lei Guan, Dongsheng Li, Yongle Chen, Jiye Liang, Wenjian Wang, Xicheng Lu

TL;DR

PipeOptim tackles weight inconsistency and weight staleness in asynchronous 1F1B pipeline training by introducing an optimizer-dependent weight prediction strategy. The core idea is to predict future weights ahead of the forward pass using the current weights, learning rate, and the optimizer's update rule, ensuring stiffness-free forward computations while backward propagation uses fresh weights. The method adapts to SGDM, Adam, and AdamW, maintaining at most two weight versions per GPU and achieving higher throughput and comparable or better accuracy than competing PMP approaches across multiple models and tasks. Experiments demonstrate PipeOptim's robustness to optimizer choice, superior overall performance, and memory efficiency, making it a practical option for scalable, high-throughput DNN training on multi-GPU systems.

Abstract

Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim can be accessible at https://github.com/guanleics/PipeOptim.

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Challenges of "1F1B" Schedule
The PipeOptim Approach
Weight prediction by PipeOptim
Weight prediction formula
Computation of $s$
Computation of $\Delta {\mathbf W}_{t}$
Comparision with asynchronous PMP approaches
PipeOptim vs. PipeDream & PipeDream-2BW
PipeOptim vs. SpecTrain and XPipe
The PipeOptim Workflow
Experimental Results
Experimental Setup
Accuracy
Throughput
...and 7 more sections

Figures (10)

Figure 1: Timelines of serial execution, GPipe, the naive approach, and PipeDream. In Figures \ref{['fig:gpipe']}, \ref{['fig:naive']}, and \ref{['fig:pipedream']}, the grey dashed arrows represent the pipeline training of the 5th mini-batch (micro-batches 17, 18, 19, and 20 for GPipe). The blue squares on the right side of Figure \ref{['fig:pipedream']} indicate the weights needed to be maintained during the training period of the 5th mini-batch. The blue squares on the right side of Figure \ref{['fig:pipeoptim']} show the maintained weights for the forward pass of the 5th mini-batch.
Figure 2: Timelines of PipeOptim. The grey dashed arrows represent the pipeline training of the 5th mini-batch; The blue squares on the right side of the figure illustrate the weights maintained for the forward pass of the 5th mini-batch.
Figure 3: Experiment results of Group-1. Learning curves about top-1 accuracy versus epochs.
Figure 4: Experiment results of Group-2. Learning curves about top-1 accuracy versus epochs.
Figure 5: Experiment results of Group-3. Figure \ref{['convergence-adam-lstm-top1']}: top-1 accuracy versus epochs; Figures \ref{['convergence-gnmt8']} and \ref{['convergence-gnmt16']}: BLEU score versus epochs.
...and 5 more figures

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

TL;DR

Abstract

PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)