Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu; Shaolong Li; Longbo Huang

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu, Shaolong Li, Longbo Huang

TL;DR

The paper tackles the prohibitive computational cost of pretraining transformers by identifying algorithmic FLOP redundancy and introducing Mixed Sparsity Training (MST). MST unifies Dynamic Sparse Training with Sparsity Variation and Hybrid Sparse Attention, operating through warm-up, ultra-sparsification, and restoration phases to maintain performance while reducing FLOPs. The key innovations are the Mixed-Growing topology evolution, sparsity-aware training, and a strided unfactorized sparse self-attention mechanism, which together yield up to $4\times$ FLOP reduction on GPT-2 without sacrificing accuracy. This work offers a practical, transferable approach that is compatible with existing system accelerations and is applicable to broader transformer pretraining tasks.

Abstract

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

TL;DR

FLOP reduction on GPT-2 without sacrificing accuracy. This work offers a practical, transferable approach that is compatible with existing system accelerations and is applicable to broader transformer pretraining tasks.

Abstract

of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of

without compromising performance.

Paper Structure (44 sections, 11 equations, 18 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 11 equations, 18 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Dynamic Sparse Training
Transformer Pruning
Structured pruning.
Semi-structured pruning.
Unstructured pruning.
Mixed Sparsity Training
Sparsity Variation
Warm-Up Phase
Ultra-Sparsification Phase
Restoration Phase
Dynamic Sparse Training
Hybrid Sparse Attention
Experiment
...and 29 more sections

Figures (18)

Figure 1: Pretraining FLOPs of GPT-2, detailed in Appendix \ref{['app:flops']}.
Figure 2: The sparsity variation of MST includes three phases: warm-up, ultra-sparsification and restoration. SV is combined with MG-based dynamic sparse training and HSA during the training.
Figure 3: Sparsity variation.
Figure 4: Piecewise cosine annealing.
Figure 5: Strided attention with stride length $l=3$.
...and 13 more figures

Theorems & Definitions (2)

Remark 3.1
Remark 3.2

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

TL;DR

Abstract

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (18)

Theorems & Definitions (2)