Table of Contents
Fetching ...

Revisiting Data Auditing in Large Vision-Language Models

Hongyu Zhu, Sichu Liang, Wenwen Wang, Boheng Li, Tongxin Yuan, Fangqi Li, ShiLin Wang, Zhuosheng Zhang

TL;DR

This work reveals that current membership inference benchmarks for large vision-language models are compromised by distribution shifts between member and non-member images, which can dominate separability without relying on memorization signals. It introduces WiRED, a principled, efficient metric based on sliced Wasserstein distances to quantify these shifts, and constructs unbiased i.i.d. MI benchmarks showing MI methods perform only marginally better than chance under realistic conditions. The authors probe the VLM embedding space to estimate Bayes-optimal limits and find a substantial irreducible error, underscoring fundamental challenges in auditing VLM training data. Nevertheless, they identify three practical auditing scenarios—fine-tuning with overfitting, access to ground-truth captions, and aggregation over image sets—where MI becomes feasible and practically valuable for detecting test-set contamination and copyright violations. The study lays a systematic groundwork for trustworthy data auditing in VLMs and guides future efforts toward robust, fair evaluation and auditing methods.

Abstract

With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

Revisiting Data Auditing in Large Vision-Language Models

TL;DR

This work reveals that current membership inference benchmarks for large vision-language models are compromised by distribution shifts between member and non-member images, which can dominate separability without relying on memorization signals. It introduces WiRED, a principled, efficient metric based on sliced Wasserstein distances to quantify these shifts, and constructs unbiased i.i.d. MI benchmarks showing MI methods perform only marginally better than chance under realistic conditions. The authors probe the VLM embedding space to estimate Bayes-optimal limits and find a substantial irreducible error, underscoring fundamental challenges in auditing VLM training data. Nevertheless, they identify three practical auditing scenarios—fine-tuning with overfitting, access to ground-truth captions, and aggregation over image sets—where MI becomes feasible and practically valuable for detecting test-set contamination and copyright violations. The study lays a systematic groundwork for trustworthy data auditing in VLMs and guides future efforts toward robust, fair evaluation and auditing methods.

Abstract

With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)--which integrate vision encoders with LLMs for accurate visual grounding--have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC > 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM's embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios--fine-tuning, access to ground-truth texts, and set-based inference--where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.

Paper Structure

This paper contains 32 sections, 10 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Revisiting of VLM MI: (1) Identifying Bias in MI Datasets, (2) Probing Bayes Optimality in VLM Inner States, (3) Future Scenarios for Data Auditing.
  • Figure 2: Distribution Shifts in Existing MI Datasets: (a) Flickr; (b) DALL-E.
  • Figure 3: Ablation Performance of Probing Methods(LLaVA-ov on COCO).
  • Figure 4: MI Performance in Aggregation-based Set Inference on LLaVA-ov.