Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Joel Lidin; Amir Sarfi; Erfan Miahi; Quentin Anthony; Shivam Chauhan; Evangelos Pappas; Benjamin Thérien; Eugene Belilovsky; Samuel Dare

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Joel Lidin, Amir Sarfi, Erfan Miahi, Quentin Anthony, Shivam Chauhan, Evangelos Pappas, Benjamin Thérien, Eugene Belilovsky, Samuel Dare

TL;DR

Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run, is described, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.

Abstract

Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democratize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 6 figures, 4 tables)

This paper contains 39 sections, 2 equations, 6 figures, 4 tables.

Introduction
Background and Methodology
SparseLoCo
Gauntlet
Communication Protocol and Systems
Hardware and parallelism.
Communication over commodity internet.
Bittensor blockchain.
Pre-Training
Setup
Model.
Data and preprocessing.
Optimization Hyperparameters & Pseudo-gradient Compression.
Inner learning rate schedule.
Main Pre-Training Results
...and 24 more sections

Figures (6)

Figure 1: Covenant-72B parallelism protocol. Each peer runs a SparseLoCo replica and communicates heavily compressed and 2-bit-quantized pseudo-gradients with other peers. Within each peer, $8\times$B200 GPUs use dynamic FSDP to shard model parameters, gradients, and training states across local GPUs. During the computation phase (inner steps), GPU $i$ requires only the inner optimizer state shards InnerOpt State$_i$ while the error-feedback EF State$_i$ is offloaded. During the communication phase, InnerOpt State$_i$ is offloaded and swapped with EF State$_i$ to compute compressed pseudo-gradients and update the error-feedback buffer.
Figure 2: Learning rate schedule.Left: pre-training inner learning rate with linear warmup, cosine decay with a flatten window, followed by an annealing phase on higher-quality data. The cosine decay was flattened due to lower participation, which required a longer decay horizon. Right: Supervised fine-tuning schedule with a 4k-context cosine stage followed by an 8k-context cosine-then-linear stage.
Figure 3: Compute--communication timelines over a two-hour window. Each row shows the breakdown of successive training rounds, with black segments denoting the compute window (inner-step training) and red segments denoting synchronization overhead. Despite training a $7.2\times$ larger model, Covenant-72B incurs only 70 s of idle time per round, compared to the 8.3 min per-round synchronization overhead reported for DiLoCo-style training in INTELLECT-1.
Figure 4: Contributing peers over the course of training. The solid curve shows the number of peers whose pseudo-gradients were selected (by Gauntlet) and included in each round's aggregation. We cap the number of contributors at 20; across the run, we observed an average of 16.9 contributing peers throughout training.
Figure 5: Cumulative unique peer participants over training. At least 70 unique peers contributed to model updates over the course of the run.
...and 1 more figures

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

TL;DR

Abstract

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Authors

TL;DR

Abstract

Table of Contents

Figures (6)