Table of Contents
Fetching ...

Privacy and Accuracy-Aware AI/ML Model Deduplication

Hong Guan, Lei Yu, Lixi Zhou, Li Xiong, Kanchan Chowdhury, Lulu Xie, Xusheng Xiao, Jia Zou

TL;DR

This work formalizes the problem of deduplicating DP-trained models for the first time and proposes a novel privacy- and accuracy-aware deduplication mechanism to address the problem.

Abstract

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. However, managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problems of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problems. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines that do not provide privacy guarantees, our approach improved the compression ratio by up to $35\times$ for individual models (including large language models and vision transformers). We also observed up to $43\times$ inference speedup due to the reduction of I/O operations.

Privacy and Accuracy-Aware AI/ML Model Deduplication

TL;DR

This work formalizes the problem of deduplicating DP-trained models for the first time and proposes a novel privacy- and accuracy-aware deduplication mechanism to address the problem.

Abstract

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. However, managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problems of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problems. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines that do not provide privacy guarantees, our approach improved the compression ratio by up to for individual models (including large language models and vision transformers). We also observed up to inference speedup due to the reduction of I/O operations.

Paper Structure

This paper contains 39 sections, 4 theorems, 6 equations, 9 figures, 8 tables, 5 algorithms.

Key Result

Theorem 2.1

If an algorithm $\mathcal{M}_1$ is $(\epsilon_1, \delta_1)$-DP and $\mathcal{M}_2$ is $(\epsilon_2, \delta_2)$-DP, then their sequential composition $\mathcal{M}(x) = (\mathcal{M}_1(x), \mathcal{M}_2(x))$ is $(\epsilon_1 + \epsilon_2, \delta_1 + \delta_2)$-DP.

Figures (9)

  • Figure 1: Weight disparity between DP-SGD-trained models. Left: RoBERTa finetuned on QNLI, with $\epsilon=1.0$ vs. $\epsilon=2.0$; right: RoBERTa finetuned on QNLI with $\epsilon=1.0$ vs fine-tuned on SST2 with $\epsilon=0.4$). The x- and y-axis represent blocks of two models respectively. Each point represents the disparity score lee2020fast between these blocks.
  • Figure 2: Model Deduplication Example. $M_1$ (base model) trained on D1 provides blocks for replacing the similar blocks in $M_i$ with privacy and utility bound (e.g., $\epsilon'_i<\epsilon_i+\epsilon^*_i$). $\epsilon'_2$ follows sequential composition since $M_2$ is also trained on D1, while $\epsilon'_3$ and $\epsilon'_4$ follow parallel composition (See Sec. \ref{['sec:preliminary']}). Each dataset (D1, D2, or D3) is a disjoint partition of a logical data collection, for which privacy loss should be assessed collectively.
  • Figure 3: System Overview. In the second box (B2), M1 and M2 are partitioned to the same group because they are trained on the same dataset, while M3 is in a separate group (B2-Step 1). Then, it selects M1 to be the base model, since there is no qualified base model for $M1$ (B2-Step 2). Then, M2 is assigned to M1, while M3, the unused base model in the other group, is also assigned to M1 (B2-Step 3). The next box (B3) shows how M3 deduplicates with its assigned base model M1. M3's blocks are first ordered by saliency ascendingly (B3-Step 1). Then it first deduplicates the left half of the blocks by replacing each block using the most similar block from M1, followed by an accuracy validation. If the accuracy drop is within $0.005$, it recursively deduplicates the right half (B3-Step 2). Otherwise (B3-Step 3), it rolls back the previous step (B3-Step 4), splits the current group into two, and recursively deduplicates the left half. The recursion stops when the number of blocks < 2.
  • Figure 4: Compression ratio (C.R.) for individual models in A1 to A5. C.R. of the base model is not affected by deduplication and is thus not shown. A2's base model is the first model in A3.
  • Figure 5: Latency breakdown of serving $100$ inference queries involving multiple models randomly w/o and w/ deduplication (DRD is used, and latency is represented in log scale)
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 2.1: Differential Privacy Dwork2014DP
  • Theorem 2.1: Sequential Composition Dwork2014DP
  • Theorem 2.2: Parallel Compositionmcsherry2009privacy
  • Theorem 2.3: Post-processing Dwork2014DP
  • Definition 2.2: SensitivityDwork2014DP
  • Theorem 4.1