Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

Alyssa Huang; Peihan Liu; Ryumei Nakada; Linjun Zhang; Wanrong Zhang

Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

Alyssa Huang, Peihan Liu, Ryumei Nakada, Linjun Zhang, Wanrong Zhang

TL;DR

This work introduces a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy, and analyzes the proposed algorithm under linear representation settings.

Abstract

The surge in multimodal AI's success has sparked concerns over data privacy in vision-and-language tasks. While CLIP has revolutionized multimodal learning through joint training on images and text, its potential to unintentionally disclose sensitive information necessitates the integration of privacy-preserving mechanisms. We introduce a differentially private adaptation of the Contrastive Language-Image Pretraining (CLIP) model that effectively addresses privacy concerns while retaining accuracy. Our proposed method, Dp-CLIP, is rigorously evaluated on benchmark datasets encompassing diverse vision-and-language tasks such as image classification and visual question answering. We demonstrate that our approach retains performance on par with the standard non-private CLIP model. Furthermore, we analyze our proposed algorithm under linear representation settings. We derive the convergence rate of our algorithm and show a trade-off between utility and privacy when gradients are clipped per-batch and the loss function does not satisfy smoothness conditions assumed in the literature for the analysis of DP-SGD.

Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

TL;DR

Abstract

Paper Structure (28 sections, 16 theorems, 143 equations, 1 figure, 9 tables, 1 algorithm)

This paper contains 28 sections, 16 theorems, 143 equations, 1 figure, 9 tables, 1 algorithm.

Introduction
Preliminaries
Dp-CLIP: Private and Accurate Representations
Experiments
Experiments Setup
Datasets
Model Architecture
Training Details
Image Classification Results
Image Captioning Results
Visual Question Answering Results
Theoretical Analysis of Dp-CLIP
Privacy-utility Trade-off of Dp-CLIP
Proof Outline of Theorem \ref{['thm: privacy utility tradeoff']}.
Discussion
...and 13 more sections

Key Result

Proposition 3.1

Choose $b < n/10$. There exists universal constants $C_\epsilon, C_\sigma > 0$ such that for any $\epsilon \leq C_\epsilon b^2 T/n^2$ and $\delta > 0$, Dp-CLIP is $(\epsilon,\delta)$-differentially private if we choose $\sigma \geq C_\sigma \sqrt{T \log(1/\delta)} / (n \epsilon)$.

Figures (1)

Figure 1: CLIP Pretraining Process from radford2021learning (Left) and Structure of BLIP for VQA from li2022blip (Right)

Theorems & Definitions (32)

Definition 2.1: Differential Privacy dwork2006calibrating
Proposition 3.1
Remark 3.2
Theorem 5.4: Privacy-utility Trade-off
Remark 5.5
Remark 5.6
Definition 3.1: Gaussian Mechanism
Definition 3.2: $\ell_2$-sensitivity
Proposition 3.3
proof
...and 22 more

Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

TL;DR

Abstract

Safeguarding Data in Multimodal AI: A Differentially Private Approach to CLIP Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (32)