Table of Contents
Fetching ...

Extracting Training Data from Document-Based VQA Models

Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari

TL;DR

This work uncovers privacy risks in document-based Vision-Language Models by showing that training data can be memorized and partially extracted as answers to questions even when the corresponding visual evidence is removed. It introduces a formal framework to distinguish memorization from generalization using a generalization baseline and canaries, and defines Extractable Memorization ($\mathcal{M}_E$) and Extractable Simplicity ($\mathcal{S}_E$) scores to quantify extractability under partial context. Through controlled ablations (e.g., No Text in Image, paraphrasing, perturbations, modality permutation) across Donut, Pix2Struct, and PaLI-3 on DocVQA_2021_WACV, the paper shows that memorization is modulated by training resolution and pretraining, with OCR-free models like PaLI-3 generally exhibiting less memorization at high resolutions. As a practical defense, Extraction Blocking (EB) nearly eliminates extractable data while maintaining or improving DocVQA performance, offering a feasible privacy-preserving direction, though challenges remain (e.g., potential side channels, the role of differential privacy).

Abstract

Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Information (PII) repeated once in the training set, indicating these models could divulge memorised sensitive information and therefore pose a privacy risk. We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII.

Extracting Training Data from Document-Based VQA Models

TL;DR

This work uncovers privacy risks in document-based Vision-Language Models by showing that training data can be memorized and partially extracted as answers to questions even when the corresponding visual evidence is removed. It introduces a formal framework to distinguish memorization from generalization using a generalization baseline and canaries, and defines Extractable Memorization () and Extractable Simplicity () scores to quantify extractability under partial context. Through controlled ablations (e.g., No Text in Image, paraphrasing, perturbations, modality permutation) across Donut, Pix2Struct, and PaLI-3 on DocVQA_2021_WACV, the paper shows that memorization is modulated by training resolution and pretraining, with OCR-free models like PaLI-3 generally exhibiting less memorization at high resolutions. As a practical defense, Extraction Blocking (EB) nearly eliminates extractable data while maintaining or improving DocVQA performance, offering a feasible privacy-preserving direction, though challenges remain (e.g., potential side channels, the role of differential privacy).

Abstract

Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Information (PII) repeated once in the training set, indicating these models could divulge memorised sensitive information and therefore pose a privacy risk. We quantitatively measure the extractability of information in controlled experiments and differentiate between cases where it arises from generalization capabilities or from memorization. We further investigate the factors that influence memorization across multiple state-of-the-art models and propose an effective heuristic countermeasure that empirically prevents the extractability of PII.
Paper Structure (33 sections, 2 equations, 7 figures, 2 tables)

This paper contains 33 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: A malicious user may prompt a Vision-Language Model (VLM) to reveal secret information about a victim by generating a copy of the original document with the secret information missing (black box). If the secret was part of the training question-answer pairs, the VLM may respond correctly. For ethical reasons, we anonymize (grey boxes) personal information of a DocVQA DocVQA_2021_WACV sample on which the attack is successful for the Donut model Donut. The answer is repeated only once in the whole training set, yet it is memorized.
  • Figure 2: Four examples of Personally Identifying Information (PII) extractable by Donut (first two samples from left) and Pix2Struct-Base (last two samples from right). A malicious user may query the model to reveal the PII by using a scan of the document from which the PII has been removed (black in the image). We anonymize personal information using gray boxes.
  • Figure 3: Extractability of answers for an attacker prompting the model with the original image from which the answer has been removed $I_i^{-a_i}$ and the original training question $Q_i$. The Y-axis is in logscale, therefore it overemphasizes the magnitued of lower values. PaLI-3 exhibits the lowest amount of extractable information in $M$.
  • Figure 4: Amount of samples in $M$ that are PII, and amount of samples that are unique PIIs when querying the model with $(I^{-a}, Q)$.
  • Figure 5: Distributions of the $\hat{\mathcal{M}}_E$ and $\hat{\mathcal{S}}_E$ scores for all the canaries, $E-G$ and $G$ for both Pix2Struct base 1M Pixels (three panels on the left) and Donut 2560 x 1920 (three panels on the right). Samples in $E-G$ have high memorization scores, while samples in $G$ do not.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 3.1