Table of Contents
Fetching ...

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon, Eliya Habba, Royi Rassin, Yonatan Bitton, Idan Szpektor, Amir Globerson, Yuval Elovici

TL;DR

Visual Riddles is presented, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge, and reveals that existing models lag significantly behind human performance, with Gemini-Pro-1.5 leading with 40% accuracy.

Abstract

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

TL;DR

Visual Riddles is presented, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge, and reveals that existing models lag significantly behind human performance, with Gemini-Pro-1.5 leading with 40% accuracy.

Abstract

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.
Paper Structure (45 sections, 14 figures, 13 tables)

This paper contains 45 sections, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Introducing Visual Riddles, designed to test models on their ability to use commonsense, world knowledge, hints, attributions, and factuality in interpreting complex visual cues. This resource aims to enhance models capability to handle nuanced and factual visual scenarios.
  • Figure 2: Overview of the Visual Riddles tasks: (1) Main Task: Solve open-ended questions. (2) Utilizing Hints: Use textual aids to identify key visual clues in riddles. (3) Employing Attributions: Apply web-sourced attributions to improve world-knowledge. (4) Multiple Choice: Select the correct answer to the riddle from five options. (5) Automatic Evaluation: Evaluate open-ended answers in two scenarios— Reference-Free, assessing the correctness of a candidate answer (CA) based only on the visual riddle, and Reference-Based, comparing CAs to the ground truth answer (GTA).
  • Figure 3: Amazon Mechanical Turk interface for selecting answers to open-ended riddles. Annotators are presented with an image, a question and several candidate answers, including both human-curated and model-generated predictions, and are tasked with identifying the correct responses.
  • Figure 4: Comparison of model-generated and human-generated captions that were used in the Caption$\rightarrow$LLM setup. 'X' marks captions where critical details are missing in the model-generated version, while 'V' marks captions where these details are present.
  • Figure 5: Modified images ablation study: a demonstration of the process where the model evaluates an answer's validity using two scenarios: one with the original image and another with a modified image that alters the visual clue, affecting the correctness of the original ground truth answer.
  • ...and 9 more figures