Table of Contents
Fetching ...

Differentially Private Optimization for Non-Decomposable Objective Functions

Weiwei Kong, Andrés Muñoz Medina, Mónica Ribero

TL;DR

This work addresses private training for non-decomposable, similarity-based losses used in unsupervised pre-training by introducing a DP-SGD variant, Logit-DP, that clips pairwise logit gradients and adds Gaussian noise to render the gradient private with sensitivity independent of batch size. By decomposing the gradient into pairwise similarity components, the authors obtain a bound on $ abla ext{L}$'s $L_2$-sensitivity that depends on constants $G_1$, $G_2$, and $L$ but scales with neither $n$ nor the batch size for the privacy mechanism. Empirically, Logit-DP yields performance close to non-private training on CIFAR-10 pretraining and CIFAR-100 finetuning, outperforming Naive-DP approaches that suffer from excessive noise at larger batch sizes. The work advances privacy-preserving foundations for large-scale unsupervised learning and provides practical guidance for balancing privacy budgets with high-utility representations in vision and language models.

Abstract

Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their $L_2$ sensitivity grows with the batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions -- in particular, the commonly-used contrastive loss -- that manipulates gradients of the objective function in a novel way to obtain a sensitivity of the summed gradient that is $O(1)$ for batch size $n$. We test our DP-SGD variant on some CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method's performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss.

Differentially Private Optimization for Non-Decomposable Objective Functions

TL;DR

This work addresses private training for non-decomposable, similarity-based losses used in unsupervised pre-training by introducing a DP-SGD variant, Logit-DP, that clips pairwise logit gradients and adds Gaussian noise to render the gradient private with sensitivity independent of batch size. By decomposing the gradient into pairwise similarity components, the authors obtain a bound on 's -sensitivity that depends on constants , , and but scales with neither nor the batch size for the privacy mechanism. Empirically, Logit-DP yields performance close to non-private training on CIFAR-10 pretraining and CIFAR-100 finetuning, outperforming Naive-DP approaches that suffer from excessive noise at larger batch sizes. The work advances privacy-preserving foundations for large-scale unsupervised learning and provides practical guidance for balancing privacy budgets with high-utility representations in vision and language models.

Abstract

Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their sensitivity grows with the batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions -- in particular, the commonly-used contrastive loss -- that manipulates gradients of the objective function in a novel way to obtain a sensitivity of the summed gradient that is for batch size . We test our DP-SGD variant on some CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method's performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss.
Paper Structure (19 sections, 6 theorems, 35 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 6 theorems, 35 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Lemma 4.1

Let $\mathcal{L}_X(w)$ and $Z_X^{(i,n)}(w)$ be as in eq:similarity-loss, and denote Then,

Figures (6)

  • Figure 1: (Left) Relative CIFAR10 training loss over ten runs. Relative loss is defined as the observed training loss divided by the minimum loss observed across all runs and all variants. Shaded regions bound the observed loss values over the runs, while the the dark lines represent the average relative loss observed so far. (Right) Single runs of Naive-DP with the same settings as in the left graph but with different batch sizes $n$. The $n=1000$ and $n=10000$ form mostly overalapping lines.
  • Figure 2: Relative CIFAR100 training loss for a single run. Relative loss is defined as the observed training loss divided by the minimum loss observed across all variants. Lightly colored lines are the true loss values, while the dark lines are smoothed loss values generated by a third-order Savitzky-Golay filter with a sliding window of 100 observations.
  • Figure 3: Averaged CIFAR10 confusion matrices at the last testing step for the generic embedding net experiments. Values are rounded down to the nearest whole number.
  • Figure 4: Averaged CIFAR10 confusion matrices at the last testing step for the ResNet18 experiments. Values are rounded down to the nearest whole number.
  • Figure 5: Training time related plots for the small embeddding net model on CIFAR10 over ten runs. (Left) Number of seconds per example over the number of examples seen. Shaded regions bound the observed values, while the dark lines represent the averaged values. (Right) Average training losses over the average runtime.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 2.1: Gaussian Mechanism
  • Definition 2.2: Canonical contrastive loss
  • Definition 2.3: Spreadout regularizer loss
  • Definition 2.4: Summed loss from per-example loss
  • Lemma 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Corollary 4.4
  • proof
  • Lemma 4.5
  • ...and 5 more