Differentially Private Optimization for Non-Decomposable Objective Functions
Weiwei Kong, Andrés Muñoz Medina, Mónica Ribero
TL;DR
This work addresses private training for non-decomposable, similarity-based losses used in unsupervised pre-training by introducing a DP-SGD variant, Logit-DP, that clips pairwise logit gradients and adds Gaussian noise to render the gradient private with sensitivity independent of batch size. By decomposing the gradient into pairwise similarity components, the authors obtain a bound on $ abla ext{L}$'s $L_2$-sensitivity that depends on constants $G_1$, $G_2$, and $L$ but scales with neither $n$ nor the batch size for the privacy mechanism. Empirically, Logit-DP yields performance close to non-private training on CIFAR-10 pretraining and CIFAR-100 finetuning, outperforming Naive-DP approaches that suffer from excessive noise at larger batch sizes. The work advances privacy-preserving foundations for large-scale unsupervised learning and provides practical guidance for balancing privacy budgets with high-utility representations in vision and language models.
Abstract
Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their $L_2$ sensitivity grows with the batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions -- in particular, the commonly-used contrastive loss -- that manipulates gradients of the objective function in a novel way to obtain a sensitivity of the summed gradient that is $O(1)$ for batch size $n$. We test our DP-SGD variant on some CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method's performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss.
