Table of Contents
Fetching ...

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan

TL;DR

NaturalBench tackles the gap between perceived progress in vision-language models and their ability to handle natural, adversarial samples. It introduces a vision-centric benchmark that pairs each QA with two images whose answers differ, generated via a semi-automated pipeline and verified by humans, resulting in 10k multilingual samples with fine-grained skill tagging. Across 53 VLMs, most models lag far behind human performance, revealing substantial compositional and bias-related challenges; the authors also develop debiasing-focused evaluation via VQAScore and demonstrate dynamic extension to new data sources. The work offers a pathway for dynamic, grounded evaluation and bias mitigation, providing a valuable resource to guide next-generation VLM development and evaluation.

Abstract

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a $\textbf{vision-centric}$ design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

TL;DR

NaturalBench tackles the gap between perceived progress in vision-language models and their ability to handle natural, adversarial samples. It introduces a vision-centric benchmark that pairs each QA with two images whose answers differ, generated via a semi-automated pipeline and verified by humans, resulting in 10k multilingual samples with fine-grained skill tagging. Across 53 VLMs, most models lag far behind human performance, revealing substantial compositional and bias-related challenges; the authors also develop debiasing-focused evaluation via VQAScore and demonstrate dynamic extension to new data sources. The work offers a pathway for dynamic, grounded evaluation and bias mitigation, providing a valuable resource to guide next-generation VLM development and evaluation.

Abstract

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

Paper Structure

This paper contains 12 sections, 5 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: NaturalBench examples consist of two questions and two images with alternating answers to prevent "blind" models from scoring well (e.g., those that predict the same answer regardless of the image or question, as discussed in \ref{['sec:collection']}). We compare the ground-truth answer for each (image, question) pair with predictions from leading VLMs including GPT-4o ( gpt-4o-2024-08-06), Qwen2-VL ( 72B), Llama3.2-Vision ( 90B), and Molmo ( 72B) (see \ref{['sec:results']}). Even the best models like GPT-4o lags far behind human performance (which is above 90%). \ref{['fig:natural_bench_collection']} shows the pipeline for collecting these natural adversarial examples.
  • Figure 2: Collecting NaturalBench. We use a semi-automated procedure to collect NaturalBench from natural image-text corpora like Flickr30K flickr30k. First, we identify confounding pairs of image-text samples that fail discriminative VLMs like CLIP clip and BLIP-2 blipv2, e.g., they wrongly match an image with another image's caption. Next, we prompt ChatGPT to design questions that yield different answers for each image, providing the original captions in the prompt. \ref{['sec:collection']} details this procedure. We hire human annotators to filter out incorrect VQA samples, such as " Is the motorcyclist wearing a red and white uniform?", which has an identical answer of "Yes" for both images. Unlike previous adversarial benchmarks adversarialvqahumanadversarialvqanaturaladvgoodfellow2014explaining, NaturalBench does not target any specific VQA models nor perturb the images or questions. \ref{['sec:dynamic']} extends this simple procedure to diverse data sources (e.g., non-English) to highlight its potential for future dynamic evaluations dynabench of VLMs.
  • Figure 3: Example questions in previous benchmarks solvable by commonsense knowledge. We provide example questions from existing VQA benchmarks that can be addressed using commonsense knowledge. For these questions, a "blind" language model, such as ChatGPT (without vision input), can already answer them without looking at the image.
  • Figure 4: Performance of GPT-3.5 vs. LLaVA-1.5 on previous VQA benchmarks. We split each benchmark into equal-sized training and test sets, and report zero-shot (in blue) and finetuned (in green) results. Previous benchmarks show strong language biases, allowing blind GPT-3.5 to exploit spurious answer patterns (see \ref{['sec:results']}) by finetuning on QA data without images. As a result, blind GPT-3.5 greatly surpasses random chance performance (see the red dotted line) and sometimes even matches the performance of LLaVA-1.5-7B finetuned using images. In contrast, \ref{['fig:bias_naturalbench']} shows that NaturalBench can effectively prevent blind solutions from exceeding chance.
  • Figure 5: Performance of GPT-3.5, LLaVA-1.5, and GPT-4o on NaturalBench. We also split NaturalBench (the English subset) into equal-sized training and test sets, and report zero-shot (in blue) and finetuned (in green) results. We report group accuracy ( G-Acc) (introduced in \ref{['sec:results']}), which awards a point when all four (image, question) pairs are answered correctly. We highlight key results: (1) Blind GPT-3.5 fails to surpass random chance performance (red dotted line), regardless of finetuning. (2) LLaVA-1.5 improves by $9\%$ by finetuning on NaturalBench's training images. (3) Even GPT-4o gains $10\%$ G-Acc through vision finetuning on NaturalBench; however, it falls far behind human performance (purple dotted line). These findings confirm that NaturalBench is a more vision-centric benchmark, and a potentially useful dataset for improving already advanced VLMs.
  • ...and 3 more figures