Table of Contents
Fetching ...

Photon: Federated LLM Pre-Training

Lorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Bill Marino, Yan Gao, Dongqi Cai, Zexi Li, Wanru Zhao, Xinchi Qiu, Nicholas D. Lane

TL;DR

Photon is introduced, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads, and represents the first economical system for global internet-wide LLM pre-training.

Abstract

Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.

Photon: Federated LLM Pre-Training

TL;DR

Photon is introduced, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads, and represents the first economical system for global internet-wide LLM pre-training.

Abstract

Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.

Paper Structure

This paper contains 33 sections, 9 equations, 11 figures, 8 tables, 3 algorithms.

Figures (11)

  • Figure 1: (a) Systematic diagram of Photon's three principal components - Photon aggregator, Photon LLM clients, and Photon data sources. Arrows describe interactions and message exchanges. The Photon aggregator can only communicate with the Photon LLM nodes through the Photon link. The instances responsible for storing the data samples, the Photon data sources, can uniquely stream to the Photon client bound to them. (b) Information flows between LLM clients, the aggregator, and data sources. Following initialization (1), An LLM client selected by the client sampler (2) receives model parameters from the server (3), trains on data from a data source (4,5), checkpoints (6), and then returns the updated parameters (7) to the aggregator for federated optimization (8).
  • Figure 2: The locations and bandwidth of participating clients in the Federation, with multiple nodes equipped with H100s at each site. More details are available in \ref{['tab:regions']}. Bandwidth between regions varies significantly, impacting the efficiency of Photon's aggregation procedures. The map shows the RAR topology (gray dashed line) and the PS topology (black solide line). The slowest link in the RAR topology, between Maharashtra and Quebec, acts as a bottleneck. In the PS topology, the connection speed to England limits each update's communication.
  • Figure 3: Comparison of perplexity convergence (lower is better) for Photon and centralized training with $3$B (top) and $7$B (bottom) models. The federated global model was evaluated on the C4 test set, with averaged train perplexities across clients and centralized train/test perplexities presented for both models. These large federated models show lower perplexity than centralized models and remain stable during aggregation, with minimal perplexity spikes after early rounds.
  • Figure 4: Our results show federated models to be comparable to centralized and potentially superior as they obtain lower perplexity (PP) given the same computational resources. Their perplexity gains grow with model size.
  • Figure 5: The tradeoff between time and compute resources (the larger batch size, the more resources) spent to train a model by Photon to target perplexities (top $42$ and bottom $35$). We measure the impact of the global batch size $B_g = NB_l$, where $N\in\{1,2,4,8,16\}$ (number of clients per round) and $B_l=32$ (local batch size), on the wall time needed to reach two target perplexities: $42$ (top, near the centralized baseline) and $35$ (bottom, near optimum). Fewer local steps per round (64) show clear benefits from increasing $B_g$ for both perplexity targets, but with more local work (128 and 512 steps), the returns on reduced wall time diminish.
  • ...and 6 more figures