The Future of Large Language Model Pre-training is Federated

Lorenzo Sani; Alex Iacob; Zeyu Cao; Bill Marino; Yan Gao; Tomas Paulik; Wanru Zhao; William F. Shen; Preslav Aleksandrov; Xinchi Qiu; Nicholas D. Lane

The Future of Large Language Model Pre-training is Federated

Lorenzo Sani, Alex Iacob, Zeyu Cao, Bill Marino, Yan Gao, Tomas Paulik, Wanru Zhao, William F. Shen, Preslav Aleksandrov, Xinchi Qiu, Nicholas D. Lane

TL;DR

This work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs and shows that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity.

Abstract

Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. This paradigm would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources. Thus far, we have used Photon to train LLM models to the size of 7B parameters and anticipate larger models being completed in the near future. Finally, we show that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity. Furthermore, we show that convergence is robust to partial participation, opening the avenue for compute-efficient collaborative training. Photon will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.

The Future of Large Language Model Pre-training is Federated

TL;DR

Abstract

Paper Structure (51 sections, 16 figures, 6 tables, 1 algorithm)

This paper contains 51 sections, 16 figures, 6 tables, 1 algorithm.

Introduction
The Landscape of LLM Training
Centralized Distributed Optimization
Data and Model Parallelism
Fully Sharded Data Parallelism
Bottlenecks for generative pre-training of LLMs
Mitigation of LLMs demands
Federated Learning and Local SGD
Federated Fine-tuning and Parameter Efficient Fine-tuning of LLMs
Design Principles for Federated Generative Pre-Training of LLMs
Broad Access to Data and Compute:
Limited Communication Requirements:
Broad Hardware Inclusivity:
Scalable Local Training Pipelines:
Photon Design
...and 36 more sections

Figures (16)

Figure 1: A hypothetical representation of the available data silos around the world. While scraping data from the web has taken foundation models quite far, most data remains under private entities' control. These organizations can collaborate in the federated generative pre-training of large language models to exploit their data towards the common goal of training LLMs they control. The collaborative nature of FL and its low communication requirements make this possible with only moderately powerful hardware, eliminating the prohibitive costs of pre-training.
Figure 2: The diagram describes the Photon's three principal components - Photon Aggregator, Photon LLM Node, PrivatePhoton Data Sources and PublicPhoton Data Sources - and their sub-components. Arrows describe how such elements work together or exchange messages. The Photon Aggregator can communicate with the Photon LLM Nodes only through the Photon Link. The instances responsible for storing the data samples, the Photon Data Source, can uniquely stream to the Photon LLM Node bonded to them.
Figure 3: Comparison between the perplexity of the federated global model evaluated on the centralized validation set, the train perplexities of federated clients (averaged together), and the train and test perplexities for a centralized experiment. These metrics are reported for our $75$M (a), $125$M (b), $350$M (c), and $1.3$B (d) experiments. Crucially, the stability of federated training increases with model size. For example, the centralized model outperforms the $75$M federated model while performing near-identically for the $1.3$B models. While federated aggregation initially causes large spikes in client perplexity, these subside as the clients reach a consensus on the model parameters, which happens much quicker for larger models. Following this transitory phase, aggregation applies a regularizing effect on the model performance, allowing a better model to be trained than would be possible for a single client. The server validation perplexity is a soft upper bound for the spikes, with the gap between train and validation perplexities decreasing over time.
Figure 4: Perplexity comparison between the global model evaluated on the centralized validation set, the train and test perplexities for a centralized baseline, and the training perplexities of clients (averaged together) for our naturally heterogeneous partition of The Pile using either a $75$M (a) or $125$M (b) model size. Unlike the homogeneous partition, the natural heterogeneity of the underlying datasets makes an initial consensus harder to reach for the federated model, as can be observed from the very high initial clients and server perplexities. However, like the IID partition shown in \ref{['fig:fed:perplexity-(generic-scale)']}, once clients reach consensus, performance becomes comparable to a centralized baseline.
Figure 5: The $l_2$ norms of the output activations of our $75$M (a) and $125$M (b) models trained on a naturally heterogeneous partition of The Pile. The norm of the activations is a well-known indicator of future model divergence meta_opt. As can be observed, the activations of the centralized model outpace those of the federated clients right from the start and experience a massive increase towards the end of training. The aggregation procedure keeps the federated clients in check by reducing the norm of the activations round to round.
...and 11 more figures

The Future of Large Language Model Pre-training is Federated

TL;DR

Abstract

The Future of Large Language Model Pre-training is Federated

Authors

TL;DR

Abstract

Table of Contents

Figures (16)