DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Conglong Li; Zhewei Yao; Xiaoxia Wu; Minjia Zhang; Connor Holmes; Cheng Li; Yuxiong He

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He

TL;DR

The paper tackles rising training costs by focusing on data efficiency for foundation-model pretraining. It proposes DeepSpeed Data Efficiency, a two-pronged framework that combines scalable curriculum-learning based data sampling with random layerwise token dropping (random-LTD) for data routing. Across GPT-3 1.3B, GPT-3 MoE, and BERT-large pretraining, the approach yields substantial cost savings while preserving or improving model quality, and it also improves GPT-2 and ViT finetuning, extending to CV tasks. The framework is designed to be easy to use and is open-sourced within DeepSpeed to enable broader adoption.

Abstract

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

TL;DR

Abstract

46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.

Paper Structure (18 sections, 6 figures, 15 tables)

This paper contains 18 sections, 6 figures, 15 tables.

Introduction
Background and Related Works
Design
Efficient data sampling via curriculum learning
Efficient data routing via random-LTD
Composing CL and random-LTD, tuning strategy, usage guidelines
Evaluation
GPT-3 and GPT-3 MoE pretraining
BERT-large pretraining
GPT-2 and ViT finetuning
Conclusion
Appendix
GPT-3 pretraining experimental setup and detailed results
BERT-large pretraining experimental setup and results
GPT-2 finetuning experimental setup
...and 3 more sections

Figures (6)

Figure 1: Model scale (number of parameters) and data scale (number of consumed training tokens ) of representative language models in the last 5 years bertmegatrongpt3bloompalm.
Figure 2: GPT-3 1.3B pretraining: relative model quality (baseline with full data as 100% quality) under different data consumption (1% to 100%) and training cost (when renting on Azure).
Figure 3: Design of the DeepSpeed Data Efficiency framework.
Figure 4: Transformer layers for baseline and random-LTD. The dash-line box is repeated by $l-2$ times.
Figure 5: Validation perplexity during GPT-3 1.3B pretraining, comparing the baseline and the best DeepSpeed Data Efficiency solution under 100% and 50% training data.
...and 1 more figures

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

TL;DR

Abstract

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)