Table of Contents
Fetching ...

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Dharma Teja Vooturi, Dhiraj Kalamkar, Dipankar Das, Bharat Kaul

Abstract

Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.

Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer

Abstract

Pretraining Large Language Models (LLMs) from scratch requires massive amount of compute. Aurora super computer is an ExaScale machine with 127,488 Intel PVC (Ponte Vechio) GPU tiles. In this work, we showcase LLM pretraining on Aurora at the scale of 1000s of GPU tiles. Towards this effort, we developed Optimus, an inhouse training library with support for standard large model training techniques. Using Optimus, we first pretrained Mula-1B, a 1 Billion dense model and Mula-7B-A1B, a 7 Billion Mixture of Experts (MoE) model from scratch on 3072 GPU tiles for the full 4 trillion tokens of the OLMoE-mix-0924 dataset. We then demonstrated model scaling by pretraining three large MoE models Mula-20B-A2B, Mula-100B-A7B, and Mula-220B-A10B till 100 Billion tokens on the same dataset. On our largest model Mula-220B-A10B, we pushed the compute scaling from 384 to 12288 GPU tiles and observed scaling efficiency of around 90% at 12288 GPU tiles. We significantly improved the runtime performance of MoE models using custom GPU kernels for expert computation, and a novel EP-Aware sharded optimizer resulting in training speedups up to 1.71x. As part of the Optimus library, we also developed a robust set of reliability and fault tolerant features to improve training stability and continuity at scale.

Paper Structure

This paper contains 10 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (Left) Training loss on the 4 trillion token OLMoE-Mix-0924 dataset. (Right) Training loss till 100 B training tokens on the same dataset
  • Figure 2: Benchmark performance progression of Mula-1B and Mula-7B-A1B models on 4 trillion token OLMoE-Mix-0924 dataset.
  • Figure 3: Mula-7B-A1B and allenai/OLMoE-1B-7B-0924 benchmark performance progression with evaluations done on intermediate training checkpoints.
  • Figure 4: Compute scaling of Mula-220B-10B model pretraining from 32 nodes (768 GPU tiles) to 1024 nodes (12288 GPU tiles)
  • Figure 5: Index generation (input_indices and output_indices) example on four input tokens (T=4) with four experts(N=4) and two chosen experts per input token (K=2). In the EP case on the right with two ranks, experts 0 and 1 are placed on rank 0, and experts 2 and 3 are placed on rank 1.
  • ...and 1 more figures