Modalities, a PyTorch-native Framework For Large-scale LLM Training and Research
Max Lübbering, Timm Ruland, Richard Rutmann, Felix Stollenwerk, David Fitzek, Michael Fromm, Alexander Weber, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Mehdi Ali
TL;DR
Modalities tackles the high cost and fragility of large-scale LLM ablations by delivering a PyTorch-native, end-to-end framework that unifies data-driven research with production-grade pretraining. It achieves this through a declarative YAML configuration and a registry-factory object graph that assembles a full training stack, validated and fed to a generic SPMD driver, with 93 pluggable components across 32 interfaces. The approach enables scalable training via FSDP-based, multi-parallelism pipelines, a high-throughput data pipeline, and seamless Hugging Face integration including checkpoint conversion. The design emphasizes reproducibility and extensibility, allowing rapid hypothesis testing at trillion-token scales without forking or code changes, and supports deployment on 1000+ GPUs.
Abstract
Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.
