Table of Contents
Fetching ...

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu

TL;DR

The paper tackles the practical challenges of training differential privacy-enabled RecSys models, revealing that DP-SGD incurs heavy compute from per-example noise and memory bandwidth pressure from dense gradient updates on embedding tables. It introduces LazyDP, an algorithm-software co-design that delays noise updates and aggregates noise sampling, dramatically reducing memory traffic and noise-generation costs while preserving the same differential privacy guarantees as baseline DP-SGD. Empirical results show LazyDP achieving up to ~119× throughput improvements and substantial energy savings, bringing private RecSys training closer to practical deployment on large embedding tables. This work advances PPML for RecSys by delivering a scalable, privacy-preserving training pipeline with real-world applicability in large-scale ad and recommendation systems.

Abstract

Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD's noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119x training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained.

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

TL;DR

The paper tackles the practical challenges of training differential privacy-enabled RecSys models, revealing that DP-SGD incurs heavy compute from per-example noise and memory bandwidth pressure from dense gradient updates on embedding tables. It introduces LazyDP, an algorithm-software co-design that delays noise updates and aggregates noise sampling, dramatically reducing memory traffic and noise-generation costs while preserving the same differential privacy guarantees as baseline DP-SGD. Empirical results show LazyDP achieving up to ~119× throughput improvements and substantial energy savings, bringing private RecSys training closer to practical deployment on large embedding tables. This work advances PPML for RecSys by delivering a scalable, privacy-preserving training pipeline with real-world applicability in large-scale ad and recommendation systems.

Abstract

Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of private RecSys training using DP-SGD, root-causing its several performance bottlenecks. Specifically, we identify DP-SGD's noise sampling and noisy gradient update stage to suffer from a severe compute and memory bandwidth limitation, respectively, causing significant performance overhead in training private RecSys. Based on these findings, we propose LazyDP, an algorithm-software co-design that addresses the compute and memory challenges of training RecSys with DP-SGD. Compared to a state-of-the-art DP-SGD training system, we demonstrate that LazyDP provides an average 119x training throughput improvement while also ensuring mathematically equivalent, differentially private RecSys models to be trained.
Paper Structure (25 sections, 1 theorem, 2 equations, 14 figures, 1 algorithm)

This paper contains 25 sections, 1 theorem, 2 equations, 14 figures, 1 algorithm.

Key Result

theorem 1

Let $X_1, X_2, \dots, X_n$ be independent and identically distributed (i.i.d.) Gaussian random variables with a mean value of 0 and a variance of $\sigma^2$. Then, the summation of these random variables, denoted by $Y$, follows a Gaussian distribution with a mean of 0 and a variance of $n\sigma^2$. where all $X_i$ are independent variables following Gaussian distribution with a mean of 0 and a va

Figures (14)

  • Figure 1: A RecSys model architecture using embedding layers.
  • Figure 2: Non-private SGD vs. private SGD. (a) SGD and DP-SGD both go through the same set of operations during all of forward propagation and the activation gradient derivation of backpropagation. Key differences between the backpropagation and model update of (b) SGD and (c) DP-SGD include the L2 norm clipping, noise sampling, and updating the model weights with the noisy gradient.
  • Figure 3: Breakdown of SGD and DP-SGD's training time into key stages of forward and backward propagation. SGD's training time remains almost constant regardless of the table size, so this figure only shows a single SGD data point under the default 96 GB for brevity. All data points are normalized to this SGD (the leftmost bar).
  • Figure 4: Training an embedding table using (a) SGD and (b) DP-SGD. Example assumes a pooling value of $3$. (a) SGD only incurs $3$ table reads and $3$ table writes during forward propagation and model update, respectively, exhibiting sparse table accesses. (b) In contrast, DP-SGD requires an additional $8$ noise write operations on top of SGD's default $3$ table read and $3$ table write operations, collectively invoking a dense noisy gradient update during model update.
  • Figure 5: Latency breakdown (left axis) of the model update stage in Figure \ref{['fig:motivation_latency']}. The right axis shows the latency in model update (normalized to $96$ MB), the value of which grows as model size increases.
  • ...and 9 more figures

Theorems & Definitions (1)

  • theorem 1: Sum of i.i.d. normal random variables sum_of_random