Scaling Laws for Differentially Private Language Models
Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, George Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang
TL;DR
This work extends scaling law theory to the realm of differentially private language model training by explicitly modeling compute, data, and privacy budgets. It introduces a methodology that decouples noise calibration, uses semi-parametric (and parametric) fits, and leverages fitted utility functions to predict cross-entropy under DP commitments. The results reveal that DP fundamentally changes optimal compute allocations, typically favoring smaller, token-rich training with substantial data, and show meaningful compute savings and privacy-utility tradeoffs compared to non-private baselines. The findings offer a practical, budget-aware framework to guide DP training of large language models and highlight important considerations for privacy risk and computational cost in real-world deployments.
Abstract
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility tradeoffs and the optimal training configurations in many settings.
