Table of Contents
Fetching ...

Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Arshia Hemmat, Adam Davies, Tom A. Lamb, Jianhao Yuan, Philip Torr, Ashkan Khakzar, Francesco Pinto

Abstract

Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{https://arshiahemmat.github.io/illusionbench/}

Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Abstract

Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{https://arshiahemmat.github.io/illusionbench/}

Paper Structure

This paper contains 70 sections, 1 equation, 20 figures, 1 table.

Figures (20)

  • Figure 1: Can vision-language models recognize these shapes?IllusionBench dataset contains images in which scene elements are arranged to represent abstract shapes.
  • Figure 2: Dataset generation. For each of the 3 datasets in IllusionBench, we show an example image from the dataset alongside an example scene prompt and an example shape conditioning image used to generate it. A shape image $x_{i}$ (with the class name $c_{i}$) and a scene description $s_{j}$ are combined to generate the IllusionBench image $x_{ij}$.
  • Figure 3: Zero-Shot Results. Average shape and scene recall of VLMs across each IllusionBench dataset, compared with Stylized-ImageNet geirhos2018imagenettrained (rightmost, shaded).
  • Figure 4: ICL Learning Tasks. Figure depicting the four ICL learning tasks, $ICL1, ICL2, ICL3$ and $ICL4$, defined by constraints on demonstration example selection as introduced \ref{['sec:icl_exp']}.
  • Figure 5: ICL Results. Few-shot (0,1,2 and 4-shot) shape and scene recall of VLMs averaged across the IllusionBench-LOGO, IllusionBench-IN and IllusionBench-ICON datasets, displayed for the different ICL learning tasks and the different prediction tasks.
  • ...and 15 more figures