Table of Contents
Fetching ...

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu, Shaolong Li, Longbo Huang

TL;DR

The paper tackles the prohibitive computational cost of pretraining transformers by identifying algorithmic FLOP redundancy and introducing Mixed Sparsity Training (MST). MST unifies Dynamic Sparse Training with Sparsity Variation and Hybrid Sparse Attention, operating through warm-up, ultra-sparsification, and restoration phases to maintain performance while reducing FLOPs. The key innovations are the Mixed-Growing topology evolution, sparsity-aware training, and a strided unfactorized sparse self-attention mechanism, which together yield up to $4\times$ FLOP reduction on GPT-2 without sacrificing accuracy. This work offers a practical, transferable approach that is compatible with existing system accelerations and is applicable to broader transformer pretraining tasks.

Abstract

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

TL;DR

The paper tackles the prohibitive computational cost of pretraining transformers by identifying algorithmic FLOP redundancy and introducing Mixed Sparsity Training (MST). MST unifies Dynamic Sparse Training with Sparsity Variation and Hybrid Sparse Attention, operating through warm-up, ultra-sparsification, and restoration phases to maintain performance while reducing FLOPs. The key innovations are the Mixed-Growing topology evolution, sparsity-aware training, and a strided unfactorized sparse self-attention mechanism, which together yield up to FLOP reduction on GPT-2 without sacrificing accuracy. This work offers a practical, transferable approach that is compatible with existing system accelerations and is applicable to broader transformer pretraining tasks.

Abstract

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of without compromising performance.
Paper Structure (44 sections, 11 equations, 18 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 11 equations, 18 figures, 10 tables, 1 algorithm.

Figures (18)

  • Figure 1: Pretraining FLOPs of GPT-2, detailed in Appendix \ref{['app:flops']}.
  • Figure 2: The sparsity variation of MST includes three phases: warm-up, ultra-sparsification and restoration. SV is combined with MG-based dynamic sparse training and HSA during the training.
  • Figure 3: Sparsity variation.
  • Figure 4: Piecewise cosine annealing.
  • Figure 5: Strided attention with stride length $l=3$.
  • ...and 13 more figures

Theorems & Definitions (2)

  • Remark 3.1
  • Remark 3.2