Table of Contents
Fetching ...

Resource-Adaptive Federated Text Generation with Differential Privacy

Jiayi Wang, John Gounley, Heidi Hanson

TL;DR

This work proposes a flexible participation framework that adapts to client capacities, and improves distribution alignment and downstream robustness under DP and heterogeneity, and improves distribution alignment and downstream robustness under DP and heterogeneity.

Abstract

In cross-silo federated learning (FL), sensitive text datasets remain confined to local organizations due to privacy regulations, making repeated training for each downstream task both communication-intensive and privacy-demanding. A promising alternative is to generate differentially private (DP) synthetic datasets that approximate the global distribution and can be reused across tasks. However, pretrained large language models (LLMs) often fail under domain shift, and federated finetuning is hindered by computational heterogeneity: only resource-rich clients can update the model, while weaker clients are excluded, amplifying data skew and the adverse effects of DP noise. We propose a flexible participation framework that adapts to client capacities. Strong clients perform DP federated finetuning, while weak clients contribute through a lightweight DP voting mechanism that refines synthetic text. To ensure the synthetic data mirrors the global dataset, we apply control codes (e.g., labels, topics, metadata) that represent each client's data proportions and constrain voting to semantically coherent subsets. This two-phase approach requires only a single round of communication for weak clients and integrates contributions from all participants. Experiments show that our framework improves distribution alignment and downstream robustness under DP and heterogeneity.

Resource-Adaptive Federated Text Generation with Differential Privacy

TL;DR

This work proposes a flexible participation framework that adapts to client capacities, and improves distribution alignment and downstream robustness under DP and heterogeneity, and improves distribution alignment and downstream robustness under DP and heterogeneity.

Abstract

In cross-silo federated learning (FL), sensitive text datasets remain confined to local organizations due to privacy regulations, making repeated training for each downstream task both communication-intensive and privacy-demanding. A promising alternative is to generate differentially private (DP) synthetic datasets that approximate the global distribution and can be reused across tasks. However, pretrained large language models (LLMs) often fail under domain shift, and federated finetuning is hindered by computational heterogeneity: only resource-rich clients can update the model, while weaker clients are excluded, amplifying data skew and the adverse effects of DP noise. We propose a flexible participation framework that adapts to client capacities. Strong clients perform DP federated finetuning, while weak clients contribute through a lightweight DP voting mechanism that refines synthetic text. To ensure the synthetic data mirrors the global dataset, we apply control codes (e.g., labels, topics, metadata) that represent each client's data proportions and constrain voting to semantically coherent subsets. This two-phase approach requires only a single round of communication for weak clients and integrates contributions from all participants. Experiments show that our framework improves distribution alignment and downstream robustness under DP and heterogeneity.
Paper Structure (20 sections, 11 equations, 4 figures, 14 tables, 4 algorithms)

This paper contains 20 sections, 11 equations, 4 figures, 14 tables, 4 algorithms.

Figures (4)

  • Figure 1: To perform DP synthetic text generation in cross-silo FL, we aim to address two challenges: heterogeneity in computational resources and data distributions. Our approach enables flexible participation through DP federated finetuning of the generator model on well-resourced clients and DP voting on generated synthetic text on the remaining clients.
  • Figure 2: MAUVE score for Yelp IID results and NER macro F1 score for PubMed IID results.
  • Figure A.3: The vote distributions of Yelp synthetic data generated for restaurant and rating stars 3. The left two figures are results only using the pretrained model. The right two figures are results using finetuned model with $10\%$$\mathcal{C}_s$ clients. Before finetuning, most samples receive similar numbers of votes, limiting the effect of refinement. After DP finetuning, however, a long-tail distribution emerges: certain samples receive significantly more votes than others (Figure \ref{['fig:c10-dist']}), reflecting the model’s improved ability to generate text that aligns with the original data distribution. This concentration of votes on high-quality samples is precisely what our refinement step exploits to enhance synthetic data quality.
  • Figure A.4: The vote distributions of PubMed synthetic data generated for A, D. The left two figures are results only using the pretrained model. The right two figures are results using finetuned model with $10\%$$\mathcal{C}_s$ clients. Similar observations as those from Figure \ref{['fig:vote-dist']} can be obtained.

Theorems & Definitions (1)

  • Definition 1: Differential Privacy dwork2014algorithmic