Table of Contents
Fetching ...

Contamination Detection for VLMs using Multi-Modal Semantic Perturbation

Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, Yong Jae Lee

TL;DR

This work tackles test-set leakage in Vision-Language Models by introducing a detection framework based on multi-modal semantic perturbation. The method generates semantically perturbed image–text pairs that preserve difficulty while altering the correct answer, enabling detection of memorization as a generalization failure. The authors formalize contamination, establish practicality, reliability, and consistency as detection criteria, and demonstrate robust detection across different models, fine-tuning regimes, and benchmarks (MMStar and RealWorldQA). They also validate robustness through ablations, real-world counterfactuals, and larger-scale models, highlighting practical implications for decontaminating and evaluating VLMs in real-world settings.

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.

Contamination Detection for VLMs using Multi-Modal Semantic Perturbation

TL;DR

This work tackles test-set leakage in Vision-Language Models by introducing a detection framework based on multi-modal semantic perturbation. The method generates semantically perturbed image–text pairs that preserve difficulty while altering the correct answer, enabling detection of memorization as a generalization failure. The authors formalize contamination, establish practicality, reliability, and consistency as detection criteria, and demonstrate robust detection across different models, fine-tuning regimes, and benchmarks (MMStar and RealWorldQA). They also validate robustness through ablations, real-world counterfactuals, and larger-scale models, highlighting practical implications for decontaminating and evaluating VLMs in real-world settings.

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.

Paper Structure

This paper contains 22 sections, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Example of our multi-modal semantic perturbation pipeline applied to RealWorldQA benchmark. Using ControlNet trained with Flux models, a new speed limit sign is generated, changing the correct answer from (B) to (C) while preserving the original image's overall composition. A contaminated model that has memorized the original question is likely to fail on the perturbed version.
  • Figure 2: Illustration of our multi-modal semantic perturbation pipeline. The original question–image pair is used to generate a dense caption with an LLM, which guides Flux ControlNet to produce a perturbed image and new answer, yielding a modified but semantically consistent benchmark sample.
  • Figure 3: Example where the perturbed variant is easier to solve than the original. In the original image, the traffic sign is small and the text barely legible; after perturbation, the sign is enlarged and clearly visible.
  • Figure 4: Example where a contaminated model answers both the original and perturbed questions correctly. This may occur when visual details change significantly that the perturbed image no longer closely resembles the original.
  • Figure : (a) Original
  • ...and 5 more figures