Table of Contents
Fetching ...

INTELLECT-1 Technical Report

Sami Jaghouar, Jack Min Ong, Manveer Basra, Fares Obeid, Jannik Straube, Michael Keiblinger, Elie Bakouch, Lucas Atkins, Maziyar Panahi, Charles Goddard, Max Ryabinin, Johannes Hagemann

TL;DR

INTELLECT-1 addresses the scalability challenge of training frontier language models by enabling decentralized, community-driven computation through the PRIME framework. PRIME combines ElasticDeviceMesh, DiLoCo, and a hybrid FSDP-DiLoCo approach with int8 gradient quantization to achieve fault-tolerant, bandwidth-constrained training across global GPUs, while preserving convergence. The work demonstrates a full-stack open-source release including the base model, intermediate checkpoints, pretraining and post-training data, and the PRIME framework, validated on a globally distributed 10B-parameter model trained on $10^{12}$ tokens with $83\%-96\%$ compute utilization. The findings suggest that decentralized, incentive-aligned compute networks can enable scalable, open, frontier AI progress, though challenges remain in bandwidth variability, extreme node churn, and the need for a dedicated collective-communications stack for global internet-scale training.

Abstract

In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.

INTELLECT-1 Technical Report

TL;DR

INTELLECT-1 addresses the scalability challenge of training frontier language models by enabling decentralized, community-driven computation through the PRIME framework. PRIME combines ElasticDeviceMesh, DiLoCo, and a hybrid FSDP-DiLoCo approach with int8 gradient quantization to achieve fault-tolerant, bandwidth-constrained training across global GPUs, while preserving convergence. The work demonstrates a full-stack open-source release including the base model, intermediate checkpoints, pretraining and post-training data, and the PRIME framework, validated on a globally distributed 10B-parameter model trained on tokens with compute utilization. The findings suggest that decentralized, incentive-aligned compute networks can enable scalable, open, frontier AI progress, though challenges remain in bandwidth variability, extreme node churn, and the need for a dedicated collective-communications stack for global internet-scale training.

Abstract

In this report, we introduce INTELLECT-1, the first 10 billion parameter language model collaboratively trained across the globe, demonstrating that large-scale model training is no longer confined to large corporations but can be achieved through a distributed, community-driven approach. INTELLECT-1 was trained on 1 trillion tokens using up to 14 concurrent nodes distributed across 3 continents, with contributions from 30 independent compute providers dynamically joining and leaving the training process, while maintaining 83-96% compute utilization and 36.2-41.4% model FLOPS utilization. We leverage PRIME, our scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes. Key innovations in PRIME include the ElasticDeviceMesh, which manages dynamic global process groups for fault-tolerant communication across the internet and local process groups for communication within a node, live checkpoint recovery kernels, and a hybrid DiLoCo-FSDP2 implementation. Using PRIME with DiLoCo and our custom int8 all-reduce, we achieve a 400x reduction in communication bandwidth compared to traditional data-parallel training settings while delivering comparable performance. These results demonstrate the feasibility and promise of training frontier foundation models in a decentralized network of global GPU resources.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The topology of the ElasticDeviceMesh. Each process in the ElasticDeviceMesh is assigned a local and global rank. The local rank is used by the FSDP process groups, while the global rank is used by an independent fault-tolerant data-parallel process group.
  • Figure 2: Locations of the nodes by all 30 compute contributors for intellect-1. The lines between nodes illustrate the Ring-All-Reduce topology, spanning the whole globe from the US to Europe, Asia, and back to the US.
  • Figure 3: Distribution of all-reduce operation times across different geographical configurations. The variance increases significantly as we move from USA-only to global distribution, indicating less reliable network conditions.
  • Figure 4: Distribution of all-reduce completion times across different geographical setups. The increasing spread and right-skewed nature of the distributions highlight growing network instability as geographical distances increase. Red represents global, green represents USA and Europe, and blue represents USA-only training.
  • Figure 5: Number of active training nodes over training steps. The graph demonstrates prime's ability to handle dynamic node participation, starting with 4 nodes and scaling up to 14 nodes, while maintaining training stability despite frequent node fluctuations.
  • ...and 2 more figures