Table of Contents
Fetching ...

An Optimization Framework for Differentially Private Sparse Fine-Tuning

Mehdi Makni, Kayhan Behdin, Gabriel Afriat, Zheng Xu, Sergei Vassilvitskii, Natalia Ponomareva, Hussein Hazimeh, Rahul Mazumder

TL;DR

The paper tackles the utility gap in training large pre-trained models under differential privacy by proposing SPARTA, an optimization-based framework that jointly selects a sparse subnetwork and fine-tunes its weights using private gradient information. SPARTA privately estimates group-level gradient signals, selects the top-k groups via a group-scoring mechanism with a Taylor-based approximation, and accounts for privacy in a way that matches the cost of one DP-SGD epoch. Empirical results on vision tasks show SPARTA outperforms full-model DP-SGD and existing private sparse-finetuning methods, while also enabling hardware-speedups through a row-grouping strategy. This work provides a practical, private, and scalable approach to achieving higher utility with sparse DP fine-tuning on large architectures.

Abstract

Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

An Optimization Framework for Differentially Private Sparse Fine-Tuning

TL;DR

The paper tackles the utility gap in training large pre-trained models under differential privacy by proposing SPARTA, an optimization-based framework that jointly selects a sparse subnetwork and fine-tunes its weights using private gradient information. SPARTA privately estimates group-level gradient signals, selects the top-k groups via a group-scoring mechanism with a Taylor-based approximation, and accounts for privacy in a way that matches the cost of one DP-SGD epoch. Empirical results on vision tasks show SPARTA outperforms full-model DP-SGD and existing private sparse-finetuning methods, while also enabling hardware-speedups through a row-grouping strategy. This work provides a practical, private, and scalable approach to achieving higher utility with sparse DP fine-tuning on large architectures.

Abstract

Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

Paper Structure

This paper contains 19 sections, 2 theorems, 19 equations, 4 figures, 2 tables, 4 algorithms.

Key Result

Proposition 4.1

The vector $\tilde{\boldsymbol u}^t$ in tildeut is an SGM as defined in Definition sgm.

Figures (4)

  • Figure 1: row grouping operation on a 2D convolutional Layer.
  • Figure 2: Profile of Accuracy/Percentage of trainable parameters for ResNet18 under $(\varepsilon, \delta) = (1, 10^{-5})$ DP-guarantees.
  • Figure 3: Efficient implementation of sparse DP-SGD fine-tuning of our proposed row-grouping scheme.
  • Figure 4: Profile of Accuracy/Percentage of trainable parameters for DeiT Tiny under $(\varepsilon, \delta) = (2, 10^{-5})$ (left) and $(\varepsilon, \delta) = (8, 10^{-5})$ (right) DP-guarantees.

Theorems & Definitions (6)

  • Definition 2.1: DP, dwork2006differential, abadi2016deep
  • Definition 2.2: SGM, rdp1
  • Remark 3.1: Differences from neural network pruning
  • Example 3.2
  • Proposition 4.1
  • Proposition 4.2