Worldwide Federated Training of Language Models

Alex Iacob; Lorenzo Sani; Bill Marino; Preslav Aleksandrov; William F. Shen; Nicholas Donald Lane

Worldwide Federated Training of Language Models

Alex Iacob, Lorenzo Sani, Bill Marino, Preslav Aleksandrov, William F. Shen, Nicholas Donald Lane

TL;DR

This work tackles the challenge of globally pre-training language models under diverse governance and data-privacy regimes by proposing WorldLM, a federations-of-federations architecture. WorldLM partitions models into a shared backbone $B$ and personalized key layers $K$, using attention-based aggregation and residual embeddings to reconcile statistical heterogeneity while preserving autonomy across sub-federations. Empirical results on The Pile and mC4 show that WorldLM can outperform standard federated learning by a substantial margin and approach centralized or fully local performance, with added robustness under differential privacy. The approach offers a practical pathway for privacy-preserving, governance-aware LM pre-training across jurisdictions, enabling broader participation and potentially democratizing access to powerful models.

Abstract

The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally, federated learning requires collaboration across heterogeneous legal, security, and privacy regimes while accounting for the inherent locality of language data; this further exacerbates the established challenge of federated statistical heterogeneity. We propose a Worldwide Federated Language Model Training~(WorldLM) system based on federations of federations, where each federation has the autonomy to account for factors such as its industry, operating jurisdiction, or competitive environment. WorldLM enables such autonomy in the presence of statistical heterogeneity via partial model localization by allowing sub-federations to attentively aggregate key layers from their constituents. Furthermore, it can adaptively share information across federations via residual layer embeddings. Evaluations of language modeling on naturally heterogeneous datasets show that WorldLM outperforms standard federations by up to $1.91\times$, approaches the personalized performance of fully local models, and maintains these advantages under privacy-enhancing techniques.

Worldwide Federated Training of Language Models

TL;DR

and personalized key layers

, using attention-based aggregation and residual embeddings to reconcile statistical heterogeneity while preserving autonomy across sub-federations. Empirical results on The Pile and mC4 show that WorldLM can outperform standard federated learning by a substantial margin and approach centralized or fully local performance, with added robustness under differential privacy. The approach offers a practical pathway for privacy-preserving, governance-aware LM pre-training across jurisdictions, enabling broader participation and potentially democratizing access to powerful models.

Abstract

, approaches the personalized performance of fully local models, and maintains these advantages under privacy-enhancing techniques.

Paper Structure (13 sections, 9 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 9 figures, 4 tables, 1 algorithm.

Introduction
Background
Global Federated Systems
Related Work
WorldLM
Partially-personalized Aggregation
Cross-Federation Information Sharing
Experimental Design
Tasks
Evaluation
Conclusion
Appendix
The Legal Context of LLM Training

Figures (9)

Figure 1: WorldLM federations exchange information in the form of models containing a backbone, personalized layers ($\mathcal{B,K,V}$), and lower-dimensional residual embeddings serving as keys and values ($\mathcal{K,V}$). While full models are exchanged between parents and children, residuals are dynamically routed to the most appropriate sub-federation to be used in attention-based aggregation.
Figure 2: Data-perspective upon a hierarchical dataset constructed from The Pile ThePile. The LHS contains two naturally heterogeneous and quantity-skewed groupings of data sources, corresponding to organizations accessing data from the internet or the medical domain. We construct such groupings using the internet-based Common Craw (CC) and Wikipedia (WK) versus the medial data of PubMed Abstracts (PBA) and PubMed Central (PBC). To test the effectiveness of WorldLM when such a cluster relationship is absent, we swap the position of the two smaller datasets.
Figure 3: WorldLM training (a) and validation (b) local performance of the $250$M multilingual model trained on a three-level heterogeneous partitioning of mC4 constructed analogously to \ref{['fig:DatasetStructureHFL3_4']} and composed of the high-resource Italian and French languages on one side with the lower-resource Ukrainian and Bulgarian on the other. While standard FL stops improving after round $15$, WorldLM reaches a performance close to the local and centralized models. The spike in local model perplexity at the end is due to eventual overtraining on the small Bulgarian and Ukrainian datasets.
Figure 4: WorldLM training and validation performance of the $75$M (a, b) and $125$M(c, d) English models on a three-level heterogeneous partitioning of The Pile (\ref{['fig:DatasetStructureHFL3_4']}). While the hierarchical approach makes steady progress due to its attention-based aggregation and partial personalization, standard FL struggles to converge due to data heterogeneity. Crucially, the performance of WorldLM approaches that of the centralized model and partially overlaps with overfitted local models.
Figure 5: The impact of swapping WK with PBA for The Pile dataset (see \ref{['fig:DatasetStructureHFL3_4']}) four our $75$M model. The swap results in the root node having worse performance (rounds $6-7$ and $9-10$) due to being unable to reconcile the conflicting update directions from its sub-federations. Despite this fact, the personalized $\mathcal{K}$ layers of the other nodes adjust the backbone $\mathcal{B}$ to their local distribution. The cross-federation information sharing also permits the parameters of similar nodes (e.g, PBA and PBC) to jointly optimize their keys, preserving overall local test-set performance.
...and 4 more figures

Worldwide Federated Training of Language Models

TL;DR

Abstract

Worldwide Federated Training of Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)