Can Public Large Language Models Help Private Cross-device Federated Learning?

Boxin Wang; Yibo Jacky Zhang; Yuan Cao; Bo Li; H. Brendan McMahan; Sewoong Oh; Zheng Xu; Manzil Zaheer

Can Public Large Language Models Help Private Cross-device Federated Learning?

Boxin Wang, Yibo Jacky Zhang, Yuan Cao, Bo Li, H. Brendan McMahan, Sewoong Oh, Zheng Xu, Manzil Zaheer

TL;DR

This work addresses the challenge of training on-device language models under user-level differential privacy, where utility degrades for small models. By systematically exploring public pre-training, distillation from publicly trained LLMs, and a theory-grounded distribution-matching data sampling strategy, the authors show how public data can substantially boost DP-FL performance for private on-device LMs. Key contributions include (i) showing public tokenizers and public pre-training improve privacy-utility; (ii) introducing a distillation framework from LLMs to on-device LMs that enhances sample efficiency; (iii) developing a distribution-matching algorithm with theoretical guarantees that selects public data aligned with private distributions, achieving similar performance with far less public data and reduced training time. The approach offers a practical pathway to leverage public LLMs to strengthen privacy-preserving on-device NLP, with tangible gains in efficiency and accuracy under tight privacy constraints.

Abstract

We study (differentially) private federated learning (FL) of language models. The language models in cross-device FL are relatively small, which can be trained with meaningful formal user-level differential privacy (DP) guarantees when massive parallelism in training is enabled by the participation of a moderate size of users. Recently, public data has been used to improve privacy-utility trade-offs for both large and small language models. In this work, we provide a systematic study of using large-scale public data and LLMs to help differentially private training of on-device FL models, and further improve the privacy-utility tradeoff by techniques of distillation. Moreover, we propose a novel distribution matching algorithm with theoretical grounding to sample public data close to private data distribution, which significantly improves the sample efficiency of (pre-)training on public data. The proposed method is efficient and effective for training private models by taking advantage of public data, especially for customized on-device architectures that do not have ready-to-use pre-trained models.

Can Public Large Language Models Help Private Cross-device Federated Learning?

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 6 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 2 theorems, 6 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Differentially Private Federated Learning for On-device LMs
Inspiration from LLMs
Using Public Tokenizer from LLMs
Publicly pre-training for On-device LMs
Distillation from Public LLM
Distillation Design
Experimental Results
Distribution Matching
Algorithm
Theoretical Analysis
Experimental Results
Conclusion
Additional Related Work
Private Federated Learning in On-device NLP
...and 14 more sections

Key Result

Theorem 5.1

Let $\epsilon(\hat{f})=\mathbb{E}[\|\hat{f}-\ell_{\text{priv}}\|^2]$ characterise how good $\hat{f}$ is as an estimator of the true private data log-density $\ell_{\text{priv}}$ for any random function $\hat{f}\in {\mathcal{H}}$. Consider the following three quantities: Then,

Figures (4)

Figure 1: Next word (token) prediction accuracy for on-device LSTM with different tokenizers in the private FL.
Figure 2: Ablation studies on how distillation steps and top-$k$ logits in distillation impact next token prediction accuracy (Acc.) of on-device LSTM models on the dev set of the private StackOverflow dataset.
Figure 3: Visualization of PPL distribution of the private and public datasets evaluated by the private on-device LM and the public LLM. The private dataset exhibits a concentration of low PPL values, whereas the public corpus is dispersed across a broader range of PPL values, with a higher average PPL.
Figure 4: Ablation studies on how distillation steps and top-$k$ logits in distillation impact next token prediction accuracy (Acc.) of on-device LSTM models on the private StackOverflow dataset.

Theorems & Definitions (4)

Definition 2.1: $(\varepsilon, \delta)$-Differential Privacy
Theorem 5.1
Theorem D.1: Theorem \ref{['thm:main']} Restated
proof

Can Public Large Language Models Help Private Cross-device Federated Learning?

TL;DR

Abstract

Can Public Large Language Models Help Private Cross-device Federated Learning?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (4)