Differentially Private Representation Learning via Image Captioning

Tom Sander; Yaodong Yu; Maziar Sanjabi; Alain Durmus; Yi Ma; Kamalika Chaudhuri; Chuan Guo

Differentially Private Representation Learning via Image Captioning

Tom Sander, Yaodong Yu, Maziar Sanjabi, Alain Durmus, Yi Ma, Kamalika Chaudhuri, Chuan Guo

TL;DR

This work successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks.

Abstract

Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$ for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains $65.8\%$ accuracy on ImageNet-1K, considerably improving the previous SOTA of $56.5\%$.

Differentially Private Representation Learning via Image Captioning

TL;DR

Abstract

for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains

accuracy on ImageNet-1K, considerably improving the previous SOTA of

Paper Structure (24 sections, 5 equations, 8 figures, 13 tables)

This paper contains 24 sections, 5 equations, 8 figures, 13 tables.

Introduction
Background and Related Work
Approach
DP Representation Learning via Image Captioning
Strategy for Effective DP Training
Evaluation
Downstream Tasks
Experimental Setup
Main Results
Ablation Studies
Discussion and Future Work
Implementation Details
Training Details
Computation cost
Mixed Precision Package & Ghost Norm
...and 9 more sections

Figures (8)

Figure 1: (a) Few-shot ImageNet-1K linear probe accuracy comparison between DP-Cap (ours) and ViP yu2023vip (previous SOTA). DP-Cap learns better image representations using the same training data and privacy budget, and considerably surpasses synthetic initialization (syn). The privacy budget $\varepsilon$ is for the LAION dataset, and the linear classifiers are trained without DP. (b) Compositional understanding evaluation on the ARO benchmark yuksekgonul2022and. DP-Cap performance is close to non-private Cap and outperforms non-private CLIP. (c) Captions generated by DP-Cap on images from the MS-COCO 2017 lin2015microsoft test set.
Figure 2: Impact of synthetic initialization on DP-Cap. The learned image representation benefits substantially from initializing on the Shaders21K dataset. The gap between DP-Cap (random init) and DP-Cap (syn init) can be as large as 24% when evaluated using linear probing on ImageNet.
Figure 3: (a) We fix the effective noise $\sigma/B = 5.6 \times 10^{-7}$(corresponding to our (B, $\sigma$) = (1.3M, 0.728)) and show that the loss is remarkably consistent across different batch sizes, allowing us to effectively scale up batch size to improve the SNR. (b) Performance from 4 sets of parameters that provide $\varepsilon=8$, with constant number of steps 5708. From batch size 98k (used in ViP yu2023vip), to our 1.3M batch size. In contrast to ViP, DP-Cap successfully leverages the better SNR and learns features that achieve substantially better 10-shot accuracy on ImageNet even compared to a non-private MAE he2022masked trained on the same dataset (see Appendix \ref{['sec:appendix_optim']}).
Figure 4: Number of GPU hours to train DP-Cap for a single epoch on 233M samples. For the Large model, we achieve a close to $5\times$ reduction.
Figure 5: At fixed (B, $\sigma$, S), $\varepsilon$ drastically reduces with the dataset size.
...and 3 more figures

Differentially Private Representation Learning via Image Captioning

TL;DR

Abstract

Differentially Private Representation Learning via Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)