Prompt Public Large Language Models to Synthesize Data for Private On-device Applications
Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage
TL;DR
The paper tackles privacy-preserving on-device language modeling by using LLM prompts to synthesize server-side pre-training data that approximates private user distributions. It introduces three prompt types—filtering public data for mobile relevance, generating diverse mobile-chat, and transforming public text into conversations—and combines them into a hybrid dataset (LLM-mix-166G) that outperforms a C4 baseline in subsequent DP-FL fine-tuning. In production-scale experiments on Gboard across en-US and en-IN, pre-training on the synthetic data yields substantial NWP gains ($22.8\%$ and $19.0\%$ relative) and faster convergence under DP-FL, with production A/B tests showing improvements in typing metrics. A follow-up study demonstrates that post-hoc filtering with a privately trained LM (LLM-prox-32G) can further improve data quality and on-device NWP accuracy. Overall, the approach demonstrates the viability of LLM-driven synthetic data to reduce distribution mismatch and privacy costs in real-world privacy-preserving FL for mobile keyboards.
Abstract
Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.
