Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Sahil Tyagi; Feiyi Wang

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Sahil Tyagi, Feiyi Wang

Abstract

Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models. It combines parallel-systems modeling with statistical performance prediction to identify the optimal batch-size. Tula predicts training time and cost within 7.5-14% error across multiple models, and achieves up to 20x overall speedup and improves test accuracy by 9% on average over standard large-batch training on various vision tasks, thus successfully mitigating the generalization gap and accelerating training at the same time.

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Abstract

Paper Structure (26 sections, 1 theorem, 29 equations, 11 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 theorem, 29 equations, 11 figures, 1 table, 1 algorithm.

Introduction
Background and Challenges
Parallel Efficiency in Distributed Training
Statistical Efficiency in Distributed Training
Design
Parallel Performance Model
Memory Estimation Model:
Performance and Cost Modeling:
Configuration Selection Policy:
Gradient Sensitivity in Distributed Training
Statistical Performance Model
Static Gradient Scaling:
:
Evaluation
Implementation
...and 11 more sections

Key Result

Lemma 1

Assume that the objective function $f$ is $L$-smooth and that stochastic gradients computed using large batches are unbiased with bounded variance, i.e., Let the learning-rate satisfy $\eta$ > $0$, and let ags enforce bounded per-parameter scaling factors such that for every iteration $i$ and model parameter $p$, If the learning-rate satisfies then the iterates generated by ags satisfy where $

Figures (11)

Figure 1: Epoch time of ResNet50 on ImageNet, VGG11 on CIFAR100, AlexNet on CalTech101, and MobileNetv3 on CalTech256 across two cluster-sizes.
Figure 2: Test accuracy vs. batch-size. Smaller batches achieve better test accuracy than large-batch training, illustrating generalization gap in the latter.
Figure 3: A schematic overview of Tula's workflow.
Figure 4: (a) $M_{act} \propto$ batch-size. (b) Linear model to predict $M_{batch}$.
Figure 5: (a) Gradient $\ell_2$-norm in early iterations of ResNet50. (b) Largest Hessian eigenvalue and norm of the gradients over the iterations of VGG11.
...and 6 more figures

Theorems & Definitions (2)

Lemma 1: Convergence in Adaptive Gradient Scaling
proof

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Abstract

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)