Table of Contents
Fetching ...

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Aditya Chinchure, Sahithya Ravi, Raymond Ng, Vered Shwartz, Boyang Li, Leonid Sigal

TL;DR

BlackSwanSuite introduces a focused benchmark for abductive and defeasible video reasoning in unpredictable events, structured around Forecaster, Detective, and Reporter tasks that manipulate visual information access to elicit nuanced reasoning. The dataset combines 1,655 short videos from the Oops! dataset with 15,469 questions across generative, MCQ, and Y/N formats, enabling evaluation of perception, comprehension, and reasoning. Across both open- and closed-source VLMs, humans consistently outperform models on abductive and defeasible tasks, revealing substantial gaps in current architectures and training. The work highlights the need for improved perception, reasoning, and potentially novel training regimes to enable robust, defeasible video understanding with safe autonomous decision-making implications.

Abstract

The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies. Our data and leaderboard is available at blackswan.cs.ubc.ca.

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

TL;DR

BlackSwanSuite introduces a focused benchmark for abductive and defeasible video reasoning in unpredictable events, structured around Forecaster, Detective, and Reporter tasks that manipulate visual information access to elicit nuanced reasoning. The dataset combines 1,655 short videos from the Oops! dataset with 15,469 questions across generative, MCQ, and Y/N formats, enabling evaluation of perception, comprehension, and reasoning. Across both open- and closed-source VLMs, humans consistently outperform models on abductive and defeasible tasks, revealing substantial gaps in current architectures and training. The work highlights the need for improved perception, reasoning, and potentially novel training regimes to enable robust, defeasible video understanding with safe autonomous decision-making implications.

Abstract

The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies. Our data and leaderboard is available at blackswan.cs.ubc.ca.

Paper Structure

This paper contains 60 sections, 27 figures, 16 tables, 1 algorithm.

Figures (27)

  • Figure 1: BlackSwanSuite. Our benchmark involves three tasks: i) Forecaster evaluates a model's ability to hypothesize future events. ii) Detective involves abductive reasoning by explaining the hidden event, and defeasible reasoning by validating existing hypotheses. iii) Reporter again tests defeasability and the model's ability to describe the unexpected event.
  • Figure 2: BlackSwanSuite contains 1655 videos from variety of topics, as depicted above.
  • Figure 3: Qualitative results on MCQ and Y/N variants. In the video, a man swings a pillow at the Christmas tree, causing ornaments to fly towards the lady. Examples (a), (b), (c) and (d) are task questions from our dataset.
  • Figure 4: Data Collection Process. We start by filtering videos that adhere to our dataset requirements, such that they can be split into $V_{pre}$, $V_{main}$ and $V_{post}$. With 10% of data, we collect annotations to select the best annotators. With these annotators, we collect the full dataset, and report dataset quality.
  • Figure 5: Length of Videos. The median video length is 8.83 seconds. Only a small number of videos are outliers, with 29 of them being longer than 25 seconds.
  • ...and 22 more figures