Table of Contents
Fetching ...

Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research

Max Lübbering, Timm Ruland, Richard Rutmann, Felix Stollenwerk, David Fitzek, Michael Fromm, Alexander Weber, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Mehdi Ali

TL;DR

Modalities tackles the high cost and fragility of large-scale LLM ablations by delivering a PyTorch-native, end-to-end framework that unifies data-driven research with production-grade pretraining. It achieves this through a declarative YAML configuration and a registry-factory object graph that assembles a full training stack, validated and fed to a generic SPMD driver, with 93 pluggable components across 32 interfaces. The approach enables scalable training via FSDP-based, multi-parallelism pipelines, a high-throughput data pipeline, and seamless Hugging Face integration including checkpoint conversion. The design emphasizes reproducibility and extensibility, allowing rapid hypothesis testing at trillion-token scales without forking or code changes, and supports deployment on 1000+ GPUs.

Abstract

Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.

Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research

TL;DR

Modalities tackles the high cost and fragility of large-scale LLM ablations by delivering a PyTorch-native, end-to-end framework that unifies data-driven research with production-grade pretraining. It achieves this through a declarative YAML configuration and a registry-factory object graph that assembles a full training stack, validated and fed to a generic SPMD driver, with 93 pluggable components across 32 interfaces. The approach enables scalable training via FSDP-based, multi-parallelism pipelines, a high-throughput data pipeline, and seamless Hugging Face integration including checkpoint conversion. The design emphasizes reproducibility and extensibility, allowing rapid hypothesis testing at trillion-token scales without forking or code changes, and supports deployment on 1000+ GPUs.

Abstract

Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.
Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Figures (2)

  • Figure 1: High-level Modalities architecture. A self-contained YAML configuration defines an interface-level dependency graph that is resolved via a registry–factory mechanism into a resolved object graph. The resulting object graph is validated and injected into a generic SPMD training driver for distributed training and evaluation.
  • Figure 2: 8B LLaMa 3 benchmarking on the Leonardo Supercomputer turisini2023leonardo: The left and center plots demonstrate equal convergence behavior on 100B Fineweb tokens penedo2024fineweb and strong scaling behavior up to 1024 ranks. The standalone NCCL benchmark on the right shows latency/saturation behavior for different message size and different number of ranks on Leonardo.