Table of Contents
Fetching ...

Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework

Nii Osae Osae Dade, Moinul Hossain Rahat

TL;DR

The paper tackles the high time and energy costs of pre-training large language models by introducing Litespark, a framework that jointly applies architectural and algorithmic optimizations to transformer attention and MLP blocks to maximize MFU while remaining compatible with standard transformer implementations. The approach yields substantial throughput improvements (2x–6x) and energy reductions (55%–83%) across 3B and 30B Llama configurations, with MFU climbing from baselines around 44–45% to as high as ~89% at small scales and remaining advantageous at large scales. The contributions include a detailed experimental setup on a SageMaker H200 cluster with SlimPajama-627B data, two LLama-based configurations, and comprehensive measurements across distributed training, culminating in evidence of broad applicability to post-training and foundation-model contexts. The work promises meaningful practical impact by accelerating development cycles, lowering electricity costs, and reducing carbon emissions, while maintaining compatibility and extending potential benefits to inference and multimodal architectures.

Abstract

Training Large Language Models (LLMs) is plagued by long training times and massive energy consumption, with modern models requiring months of computation and gigawatt-hours of electricity. In light of these challenges,we introduce Litespark, a novel pre-training framework that addresses these inefficiencies through targeted optimizations to transformer attention and MLP layers. Our approach combines architectural improvements with algorithmic enhancements to maximize Model FLOPs Utilization (MFU) while maintaining compatibility with standard transformer implementations. Comprehensive benchmarking on 3B and 30B parameter Llama models using the SlimPajama-627B dataset demonstrates substantial performance gains: 2x-6x training throughput improvement and $55\%-83$% energy consumption reduction across multi-node H200 GPU clusters. These optimizations are model- and hardware-agnostic, enabling broad applicability across transformer architectures and extending to post-training phases including supervised fine-tuning and direct preference optimization.

Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework

TL;DR

The paper tackles the high time and energy costs of pre-training large language models by introducing Litespark, a framework that jointly applies architectural and algorithmic optimizations to transformer attention and MLP blocks to maximize MFU while remaining compatible with standard transformer implementations. The approach yields substantial throughput improvements (2x–6x) and energy reductions (55%–83%) across 3B and 30B Llama configurations, with MFU climbing from baselines around 44–45% to as high as ~89% at small scales and remaining advantageous at large scales. The contributions include a detailed experimental setup on a SageMaker H200 cluster with SlimPajama-627B data, two LLama-based configurations, and comprehensive measurements across distributed training, culminating in evidence of broad applicability to post-training and foundation-model contexts. The work promises meaningful practical impact by accelerating development cycles, lowering electricity costs, and reducing carbon emissions, while maintaining compatibility and extending potential benefits to inference and multimodal architectures.

Abstract

Training Large Language Models (LLMs) is plagued by long training times and massive energy consumption, with modern models requiring months of computation and gigawatt-hours of electricity. In light of these challenges,we introduce Litespark, a novel pre-training framework that addresses these inefficiencies through targeted optimizations to transformer attention and MLP layers. Our approach combines architectural improvements with algorithmic enhancements to maximize Model FLOPs Utilization (MFU) while maintaining compatibility with standard transformer implementations. Comprehensive benchmarking on 3B and 30B parameter Llama models using the SlimPajama-627B dataset demonstrates substantial performance gains: 2x-6x training throughput improvement and % energy consumption reduction across multi-node H200 GPU clusters. These optimizations are model- and hardware-agnostic, enabling broad applicability across transformer architectures and extending to post-training phases including supervised fine-tuning and direct preference optimization.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Pre-training throughput comparison on H200s
  • Figure 2: $\textrm{CO}_2$ emissions comparison for 3B models (left) and 30B models (right) on H200s