Table of Contents
Fetching ...

Déjà Vu Memorization in Vision-Language Models

Bargav Jayaraman, Chuan Guo, Kamalika Chaudhuri

TL;DR

A new method for measuring memorization in VLMs is proposed, which is called d\'ej\`a vu memorization, and it is shown that text randomization considerably mitigates memorization while only moderately impacting the model's downstream task performance.

Abstract

Vision-Language Models (VLMs) have emerged as the state-of-the-art representation learning solution, with myriads of downstream applications such as image classification, retrieval and generation. A natural question is whether these models memorize their training data, which also has implications for generalization. We propose a new method for measuring memorization in VLMs, which we call déjà vu memorization. For VLMs trained on image-caption pairs, we show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption. We evaluate déjà vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs. Finally, we show that text randomization considerably mitigates memorization while only moderately impacting the model's downstream task performance.

Déjà Vu Memorization in Vision-Language Models

TL;DR

A new method for measuring memorization in VLMs is proposed, which is called d\'ej\`a vu memorization, and it is shown that text randomization considerably mitigates memorization while only moderately impacting the model's downstream task performance.

Abstract

Vision-Language Models (VLMs) have emerged as the state-of-the-art representation learning solution, with myriads of downstream applications such as image classification, retrieval and generation. A natural question is whether these models memorize their training data, which also has implications for generalization. We propose a new method for measuring memorization in VLMs, which we call déjà vu memorization. For VLMs trained on image-caption pairs, we show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption. We evaluate déjà vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs. Finally, we show that text randomization considerably mitigates memorization while only moderately impacting the model's downstream task performance.
Paper Structure (40 sections, 4 equations, 12 figures, 1 algorithm)

This paper contains 40 sections, 4 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: An example where a CLIP radford2021learning model trained on a 40M subset of a Shutterstock data set exhibits déjà vu memorization of objects present in a training image. Public set is a separate collection of 20M images from Shutterstock that has no overlap with the training set. The objects annotated in orange are true positives, i.e., the ones present in the target image, and the objects annotated in blue are false positives. Our test recovers significantly more memorized objects for the target VLM (trained on the target image) compared to the reference VLM (not trained on the target image). Additional qualitative examples can be found in \ref{['fig:additional_examples_coco']} in the appendix.
  • Figure 2: Utility and déjà vu memorization of ViT-B-32 CLIP models with varying training set sizes. Model utility is quantified in terms of ImageNet zero-shot accuracy. Population-level memorization of models is measured using the metrics defined in \ref{['sec:metrics']} over various public sets (a): training set sampled from filtered LAION and ImageNet is used as public set. (b): training set sampled from filtered LAION and a holdout filtered LAION-50M set is used as public set. (c): training set sampled from Shutterstock and a holdout SS-20M set is used as public set. For the memorization metrics, we report the mean$\pm$std values (std$\le$ 0.003) over 100 repetitions of randomly sampling 10% of records with replacement.
  • Figure 3: Object recall distribution of target and reference models trained on filtered LAION data set for 200 epochs with different training sizes. ImageNet is used as the public set for kNN test.
  • Figure 4: Sample-level memorization gap between target and reference models when predicting top-10 objects for different top-$L$ records. Models are trained on disjoint 10M subsets of filtered LAION data set for 200 epochs and ImageNet public set is used for the KNN test. The model exhibits very strong déjà vu memorization on a small subset of samples, as indicated by the large precision/recall/F-score gaps when $L$ is small.
  • Figure 5: Effect of mitigation on ViT-B-32 OpenCLIP models trained on 10M subset of filtered LAION. Memorization evaluation is done using ImageNet as public set. Default setting is highlighted with asterisk. For the memorization metrics, we report the mean$\pm$std values (std$\le$ 0.003) over 100 repetitions of randomly sampling 10% of records with replacement. Among these mitigations, text masking has the best trade-off that reduces memorization without sacrificing utility.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 1: Déjà vu Memorization