Table of Contents
Fetching ...

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

TL;DR

IllusionVQA introduces a challenging optical-illusion VQA dataset to probe Vision Language Models on both comprehension and soft localization tasks. The authors curate 374 high-quality illusion images across 12 categories and generate 435 comprehension questions plus 1000 soft-localization instances, enabling a rigorous, multi-axis evaluation. Across experiments with GPT4V, Gemini-Pro, and open-source VLMs, results show large models outperform small ones but still fall far short of humans, with in-context learning and chain-of-thought sometimes impairing performance, especially on localization. The work highlights critical gaps in VLM reasoning under perceptual ambiguity and provides a dataset and baseline to push toward illusion-resilient multimodal systems for real-world robotics and AI.

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

TL;DR

IllusionVQA introduces a challenging optical-illusion VQA dataset to probe Vision Language Models on both comprehension and soft localization tasks. The authors curate 374 high-quality illusion images across 12 categories and generate 435 comprehension questions plus 1000 soft-localization instances, enabling a rigorous, multi-axis evaluation. Across experiments with GPT4V, Gemini-Pro, and open-source VLMs, results show large models outperform small ones but still fall far short of humans, with in-context learning and chain-of-thought sometimes impairing performance, especially on localization. The work highlights critical gaps in VLM reasoning under perceptual ambiguity and provides a dataset and baseline to push toward illusion-resilient multimodal systems for real-world robotics and AI.

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
Paper Structure (42 sections, 10 figures, 8 tables)

This paper contains 42 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Left: Comparison of human and VLM performance in IllusionVQA-Comprehension. Right: Comparison of IllusionVQA with prior illusion datasets - GVIL zhang2023grounding and HallusionBench liu2023hallusionbench.
  • Figure 2: Examples of optical illusions in IllusionVQA-Comprehension
  • Figure 3: Categories in IllusionVQA-Comprehension. Refer to Appendix \ref{['app:comprehension_categories']} for details.
  • Figure 4: Examples demonstrating the task of Soft-Localization in IllusionVQA.
  • Figure 5: Venn Diagrams showing the agreement between prompting techniques. There are instances where ICL and CoT cause the VLMs to answer incorrectly.
  • ...and 5 more figures