LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Juntaek Lim; Youngeun Kwon; Ranggi Hwang; Kiwan Maeng; G. Edward Suh; Minsoo Rhu

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu

TL;DR

The paper tackles the practical challenges of training differential privacy-enabled RecSys models, revealing that DP-SGD incurs heavy compute from per-example noise and memory bandwidth pressure from dense gradient updates on embedding tables. It introduces LazyDP, an algorithm-software co-design that delays noise updates and aggregates noise sampling, dramatically reducing memory traffic and noise-generation costs while preserving the same differential privacy guarantees as baseline DP-SGD. Empirical results show LazyDP achieving up to ~119× throughput improvements and substantial energy savings, bringing private RecSys training closer to practical deployment on large embedding tables. This work advances PPML for RecSys by delivering a scalable, privacy-preserving training pipeline with real-world applicability in large-scale ad and recommendation systems.

Abstract

Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD's noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119x training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained.

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

TL;DR

Abstract

Paper Structure (25 sections, 1 theorem, 2 equations, 14 figures, 1 algorithm)

This paper contains 25 sections, 1 theorem, 2 equations, 14 figures, 1 algorithm.

Introduction
Background
Recommendation Models
System Architecture for Training RecSys
Differential Privacy
Non-private SGD vs. Private DP-SGD Training
Related Work
Threat Model
Workload Characterization on Private RecSys Training with DP-SGD
Breakdown of End-to-End Training Time
Analysis on the Model Update Stage in Private Embedding Table Training
Root-causing the Key Challenges behind DP-SGD's Noise Sampling and Noisy Gradient Update
LazyDP: An Algorithm-Software Co-Design for Differentially Private RecSys Training
Design Principles and Key Observations
Implementation
...and 10 more sections

Key Result

theorem 1

Let $X_1, X_2, \dots, X_n$ be independent and identically distributed (i.i.d.) Gaussian random variables with a mean value of 0 and a variance of $\sigma^2$. Then, the summation of these random variables, denoted by $Y$, follows a Gaussian distribution with a mean of 0 and a variance of $n\sigma^2$. where all $X_i$ are independent variables following Gaussian distribution with a mean of 0 and a va

Figures (14)

Figure 1: A RecSys model architecture using embedding layers.
Figure 2: Non-private SGD vs. private SGD. (a) SGD and DP-SGD both go through the same set of operations during all of forward propagation and the activation gradient derivation of backpropagation. Key differences between the backpropagation and model update of (b) SGD and (c) DP-SGD include the L2 norm clipping, noise sampling, and updating the model weights with the noisy gradient.
Figure 3: Breakdown of SGD and DP-SGD's training time into key stages of forward and backward propagation. SGD's training time remains almost constant regardless of the table size, so this figure only shows a single SGD data point under the default 96 GB for brevity. All data points are normalized to this SGD (the leftmost bar).
Figure 4: Training an embedding table using (a) SGD and (b) DP-SGD. Example assumes a pooling value of $3$. (a) SGD only incurs $3$ table reads and $3$ table writes during forward propagation and model update, respectively, exhibiting sparse table accesses. (b) In contrast, DP-SGD requires an additional $8$ noise write operations on top of SGD's default $3$ table read and $3$ table write operations, collectively invoking a dense noisy gradient update during model update.
Figure 5: Latency breakdown (left axis) of the model update stage in Figure \ref{['fig:motivation_latency']}. The right axis shows the latency in model update (normalized to $96$ MB), the value of which grows as model size increases.
...and 9 more figures

Theorems & Definitions (1)

theorem 1: Sum of i.i.d. normal random variables sum_of_random

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

TL;DR

Abstract

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)