Table of Contents
Fetching ...

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage

TL;DR

The paper tackles privacy-preserving on-device language modeling by using LLM prompts to synthesize server-side pre-training data that approximates private user distributions. It introduces three prompt types—filtering public data for mobile relevance, generating diverse mobile-chat, and transforming public text into conversations—and combines them into a hybrid dataset (LLM-mix-166G) that outperforms a C4 baseline in subsequent DP-FL fine-tuning. In production-scale experiments on Gboard across en-US and en-IN, pre-training on the synthetic data yields substantial NWP gains ($22.8\%$ and $19.0\%$ relative) and faster convergence under DP-FL, with production A/B tests showing improvements in typing metrics. A follow-up study demonstrates that post-hoc filtering with a privately trained LM (LLM-prox-32G) can further improve data quality and on-device NWP accuracy. Overall, the approach demonstrates the viability of LLM-driven synthetic data to reduce distribution mismatch and privacy costs in real-world privacy-preserving FL for mobile keyboards.

Abstract

Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

TL;DR

The paper tackles privacy-preserving on-device language modeling by using LLM prompts to synthesize server-side pre-training data that approximates private user distributions. It introduces three prompt types—filtering public data for mobile relevance, generating diverse mobile-chat, and transforming public text into conversations—and combines them into a hybrid dataset (LLM-mix-166G) that outperforms a C4 baseline in subsequent DP-FL fine-tuning. In production-scale experiments on Gboard across en-US and en-IN, pre-training on the synthetic data yields substantial NWP gains ( and relative) and faster convergence under DP-FL, with production A/B tests showing improvements in typing metrics. A follow-up study demonstrates that post-hoc filtering with a privately trained LM (LLM-prox-32G) can further improve data quality and on-device NWP accuracy. Overall, the approach demonstrates the viability of LLM-driven synthetic data to reduce distribution mismatch and privacy costs in real-world privacy-preserving FL for mobile keyboards.

Abstract

Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.
Paper Structure (19 sections, 1 equation, 4 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 1 equation, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our experimental setup. It follows the two-step procedure of training on-device LMs for Gboard xu-etal-2023-federated: 1) pre-training using server-side public data; followed by 2) fine-tuning on the private user data with DP FL. We use LLMs to synthesize data to replace the public C4 data raffel2020exploring in step 1.
  • Figure 2: An FL training round with DP.
  • Figure 3: Overview of the procedure designed to increase the diversity of generated chat data. Given values for (AGE, GENDER, TIME, DAY, CHAT-APP), we sequentially use LLM to generate receivers, topics, and conversations (see details in Section \ref{['sec:llm-gen-chat']}).
  • Figure 4: NWP evaluation accuracy for fine-tuning models with DP FL over the real mobile devices in the (a) United States and (b) India populations. Compared to the baseline of pre-training on C4, the LM pre-trained on LLM synthetic data achieves higher initial accuracy, and also maintains superior or comparable accuracy during the fine-tuning process.

Theorems & Definitions (1)

  • Definition B.1: $(\epsilon,\delta)$-Differential Privacy