LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models
Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu
TL;DR
The paper tackles the practical challenges of training differential privacy-enabled RecSys models, revealing that DP-SGD incurs heavy compute from per-example noise and memory bandwidth pressure from dense gradient updates on embedding tables. It introduces LazyDP, an algorithm-software co-design that delays noise updates and aggregates noise sampling, dramatically reducing memory traffic and noise-generation costs while preserving the same differential privacy guarantees as baseline DP-SGD. Empirical results show LazyDP achieving up to ~119× throughput improvements and substantial energy savings, bringing private RecSys training closer to practical deployment on large embedding tables. This work advances PPML for RecSys by delivering a scalable, privacy-preserving training pipeline with real-world applicability in large-scale ad and recommendation systems.
Abstract
Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD's noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119x training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained.
