Table of Contents
Fetching ...

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

TL;DR

Across multiple datasets, training small models with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes and suggests that training on DP synthetic data can be a better option than training a model on-device on private distributed data.

Abstract

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($ε=1.29$, $ε=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

TL;DR

Across multiple datasets, training small models with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes and suggests that training on DP synthetic data can be a better option than training a model on-device on private distributed data.

Abstract

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes (, ). We achieve these results while using 9 fewer rounds, 6 less client computation per round, and 100 less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.
Paper Structure (17 sections, 1 equation, 3 figures, 6 tables, 2 algorithms)

This paper contains 17 sections, 1 equation, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: A high-level description of PrE-Text. PrE-Text consists of two main phases: (1) (iterative) DP synthetic seed collection, (2) (single-shot) synthetic seed expansion. A detailed description of steps in the diagram is given in \ref{['sec: alg description']}.
  • Figure 2: Next-token prediction accuracy for PrE-Text as we vary the number of synthetic examples generated by the Expand part of the algorithm. We find that increasing the number of synthetic examples across several orders of magnitude improves the accuracy of the downstream model (DistilGPT2) roughly log-linearly, though the growth peaks on Code after 1M samples.
  • Figure 3: The synthetic data generation prompt for Expand. The blue text after "Original Text Sample 4" is generated. We parse the generated text for the text between Original Text Sample 4 and Original Text Sample 5 and use that as a synthetic sample.

Theorems & Definitions (3)

  • Definition 2.1: Neighboring datasets
  • Definition 2.2: Differential Privacy
  • Definition 2.3: $\ell_2$ sensitivity dwork2014analyze