Table of Contents
Fetching ...

Clean Evaluations on Contaminated Visual Language Models

Hongyuan Lu, Shujie Miao, Wai Lam

TL;DR

The paper addresses data contamination in visual-language model evaluation and proposes a visual clean evaluation pipeline that operates on image inputs, formalized by comparing a baseline $M_1$ with a contaminated $M_2$ using transformed inputs $z' = t(z)$ and predictions $P(y|x,z') = M(x,z')$. A new uncontaminated dataset of 1,000 images and 2,561 QA rounds collected from Gamersky enables reliable benchmarking. The authors show that traditional image augmentations can inflate contaminated-model scores, while the proposed BGR channel swapping robustly reduces contamination effects and resists manipulation during training. Overall, the work provides a practical framework for transparent VLM evaluation and points toward broader clean-evaluation methods across modalities.

Abstract

How to evaluate large language models (LLMs) cleanly has been established as an important research era to genuinely report the performance of possibly contaminated LLMs. Yet, how to cleanly evaluate the visual language models (VLMs) is an under-studied problem. We propose a novel approach to achieve such goals through data augmentation methods on the visual input information. We then craft a new visual clean evaluation benchmark with thousands of data instances. Through extensive experiments, we found that the traditional visual data augmentation methods are useful, but they are at risk of being used as a part of the training data as a workaround. We further propose using BGR augmentation to switch the colour channel of the visual information. We found that it is a simple yet effective method for reducing the effect of data contamination and fortunately, it is also harmful to be used as a data augmentation method during training. It means that it is hard to integrate such data augmentation into training by malicious trainers and it could be a promising technique to cleanly evaluate visual LLMs. Our code, data, and model weights will be released upon publication.

Clean Evaluations on Contaminated Visual Language Models

TL;DR

The paper addresses data contamination in visual-language model evaluation and proposes a visual clean evaluation pipeline that operates on image inputs, formalized by comparing a baseline with a contaminated using transformed inputs and predictions . A new uncontaminated dataset of 1,000 images and 2,561 QA rounds collected from Gamersky enables reliable benchmarking. The authors show that traditional image augmentations can inflate contaminated-model scores, while the proposed BGR channel swapping robustly reduces contamination effects and resists manipulation during training. Overall, the work provides a practical framework for transparent VLM evaluation and points toward broader clean-evaluation methods across modalities.

Abstract

How to evaluate large language models (LLMs) cleanly has been established as an important research era to genuinely report the performance of possibly contaminated LLMs. Yet, how to cleanly evaluate the visual language models (VLMs) is an under-studied problem. We propose a novel approach to achieve such goals through data augmentation methods on the visual input information. We then craft a new visual clean evaluation benchmark with thousands of data instances. Through extensive experiments, we found that the traditional visual data augmentation methods are useful, but they are at risk of being used as a part of the training data as a workaround. We further propose using BGR augmentation to switch the colour channel of the visual information. We found that it is a simple yet effective method for reducing the effect of data contamination and fortunately, it is also harmful to be used as a data augmentation method during training. It means that it is hard to integrate such data augmentation into training by malicious trainers and it could be a promising technique to cleanly evaluate visual LLMs. Our code, data, and model weights will be released upon publication.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Two examples from the dataset. Pairs of images with corresponding questions and answers.
  • Figure 2: Distribution of the game genres in our collected dataset. 'Elden Ring' (605 instances), 'Anime Games' (227 instances, including Genshin Impact and Honkai: Star Rail), 'Other RPG Games' (24 instances, with titles like Dungeon I& Fighter), 'Shooting Games' (122 instances, featuring GTA V, Valorant, and Delta Force), and 'Others' (22 instances, including Palworld and League of Legends).