Table of Contents
Fetching ...

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

Xuanyu Chen, Nan Yang, Shuai Wang, Dong Yuan

TL;DR

This paper addresses how federated data distributions affect the compute-optimal model size in scaling laws for large models. By modeling federated training as SGD over distributed data and applying a PAC-Bayes generalization bound, it derives analytic expressions for the optimal model size $d^*$ under federated and centralized settings and shows a negative power-law relationship with the number of clients $n$ when total compute is fixed. It also reveals a generalization-gap tendency in FL relative to centralized training and demonstrates that $d^*$ is effectively determined by average per-client compute rather than total compute. Empirical validation with Vision Transformer and ResNet experiments corroborates the theory, offering practical guidelines for size selection in distributed training and highlighting the trade-offs between data decentralization, compute, and generalization.

Abstract

The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

Scaling Law Analysis in Federated Learning: How to Select the Optimal Model Size?

TL;DR

This paper addresses how federated data distributions affect the compute-optimal model size in scaling laws for large models. By modeling federated training as SGD over distributed data and applying a PAC-Bayes generalization bound, it derives analytic expressions for the optimal model size under federated and centralized settings and shows a negative power-law relationship with the number of clients when total compute is fixed. It also reveals a generalization-gap tendency in FL relative to centralized training and demonstrates that is effectively determined by average per-client compute rather than total compute. Empirical validation with Vision Transformer and ResNet experiments corroborates the theory, offering practical guidelines for size selection in distributed training and highlighting the trade-offs between data decentralization, compute, and generalization.

Abstract

The recent success of large language models (LLMs) has sparked a growing interest in training large-scale models. As the model size continues to scale, concerns are growing about the depletion of high-quality, well-curated training data. This has led practitioners to explore training approaches like Federated Learning (FL), which can leverage the abundant data on edge devices while maintaining privacy. However, the decentralization of training datasets in FL introduces challenges to scaling large models, a topic that remains under-explored. This paper fills this gap and provides qualitative insights on generalizing the previous model scaling experience to federated learning scenarios. Specifically, we derive a PAC-Bayes (Probably Approximately Correct Bayesian) upper bound for the generalization error of models trained with stochastic algorithms in federated settings and quantify the impact of distributed training data on the optimal model size by finding the analytic solution of model size that minimizes this bound. Our theoretical results demonstrate that the optimal model size has a negative power law relationship with the number of clients if the total training compute is unchanged. Besides, we also find that switching to FL with the same training compute will inevitably reduce the upper bound of generalization performance that the model can achieve through training, and that estimating the optimal model size in federated scenarios should depend on the average training compute across clients. Furthermore, we also empirically validate the correctness of our results with extensive training runs on different models, network settings, and datasets.

Paper Structure

This paper contains 23 sections, 23 theorems, 119 equations, 3 figures, 2 tables.

Key Result

Lemma 1

For any positive real $\delta \in (0, 1)$, with probability at least $1 - \delta$ over a sample of size $N$, we have the following inequality for the distribution of the output hypothesis $Q$ and the prior $P$: where $\mathcal{D}(Q||P)$ is the KL divergence between the distributions $Q$ and $P$ and is defined as: $\mathcal{D}(Q||P) = \mathbb{E}_{\theta \sim Q} \log(\frac{Q(\theta)}{P(\theta)})$.

Figures (3)

  • Figure : (a) CIFAR-100
  • Figure : (a) CIFAR-100
  • Figure : (a) Impact of distributed data on the optimal model size.

Theorems & Definitions (27)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • Lemma 3
  • Theorem 3
  • Remark 1
  • Theorem 4
  • Remark 2
  • Remark 2
  • ...and 17 more