Table of Contents
Fetching ...

Differentially Private Representation Learning via Image Captioning

Tom Sander, Yaodong Yu, Maziar Sanjabi, Alain Durmus, Yi Ma, Kamalika Chaudhuri, Chuan Guo

TL;DR

This work successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks.

Abstract

Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$ for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains $65.8\%$ accuracy on ImageNet-1K, considerably improving the previous SOTA of $56.5\%$.

Differentially Private Representation Learning via Image Captioning

TL;DR

This work successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks.

Abstract

Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains accuracy on ImageNet-1K, considerably improving the previous SOTA of .
Paper Structure (24 sections, 5 equations, 8 figures, 13 tables)

This paper contains 24 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: (a) Few-shot ImageNet-1K linear probe accuracy comparison between DP-Cap (ours) and ViP yu2023vip (previous SOTA). DP-Cap learns better image representations using the same training data and privacy budget, and considerably surpasses synthetic initialization (syn). The privacy budget $\varepsilon$ is for the LAION dataset, and the linear classifiers are trained without DP. (b) Compositional understanding evaluation on the ARO benchmark yuksekgonul2022and. DP-Cap performance is close to non-private Cap and outperforms non-private CLIP. (c) Captions generated by DP-Cap on images from the MS-COCO 2017 lin2015microsoft test set.
  • Figure 2: Impact of synthetic initialization on DP-Cap. The learned image representation benefits substantially from initializing on the Shaders21K dataset. The gap between DP-Cap (random init) and DP-Cap (syn init) can be as large as 24% when evaluated using linear probing on ImageNet.
  • Figure 3: (a) We fix the effective noise $\sigma/B = 5.6 \times 10^{-7}$(corresponding to our (B, $\sigma$) = (1.3M, 0.728)) and show that the loss is remarkably consistent across different batch sizes, allowing us to effectively scale up batch size to improve the SNR. (b) Performance from 4 sets of parameters that provide $\varepsilon=8$, with constant number of steps 5708. From batch size 98k (used in ViP yu2023vip), to our 1.3M batch size. In contrast to ViP, DP-Cap successfully leverages the better SNR and learns features that achieve substantially better 10-shot accuracy on ImageNet even compared to a non-private MAE he2022masked trained on the same dataset (see Appendix \ref{['sec:appendix_optim']}).
  • Figure 4: Number of GPU hours to train DP-Cap for a single epoch on 233M samples. For the Large model, we achieve a close to $5\times$ reduction.
  • Figure 5: At fixed (B, $\sigma$, S), $\varepsilon$ drastically reduces with the dataset size.
  • ...and 3 more figures