Table of Contents
Fetching ...

Impossible Videos

Zechen Bai, Hai Ci, Mike Zheng Shou

TL;DR

Impossible videos are proposed as a new testbed to push video generation and understanding beyond real-world data. The authors introduce IPV-Bench, consisting of a four-domain taxonomy, the IPV-Txt prompt suite, and the IPV-Vid video dataset, to rigorously evaluate both generation and understanding under counterfactual scenarios. Through extensive experiments, they reveal that state-of-the-art models struggle with impossible content, especially in temporal reasoning, and identify key directions for improving temporal modules and world-knowledge integration. The work provides a public benchmark framework to drive next-generation video models capable of reasoning about anti-reality scenarios.

Abstract

Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.

Impossible Videos

TL;DR

Impossible videos are proposed as a new testbed to push video generation and understanding beyond real-world data. The authors introduce IPV-Bench, consisting of a four-domain taxonomy, the IPV-Txt prompt suite, and the IPV-Vid video dataset, to rigorously evaluate both generation and understanding under counterfactual scenarios. Through extensive experiments, they reveal that state-of-the-art models struggle with impossible content, especially in temporal reasoning, and identify key directions for improving temporal modules and world-knowledge integration. The work provides a public benchmark framework to drive next-generation video models capable of reasoning about anti-reality scenarios.

Abstract

Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.

Paper Structure

This paper contains 26 sections, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Impossible Video Examples with Impossible Type and Explanation.
  • Figure 2: Overview of the IPV-Bench Benchmark.IPV-Bench is structured with a comprehensive taxonomy, enabling the creation of a diverse prompt suite (IPV-Txt) and a high-quality video dataset (IPV-Vid). These components facilitate the evaluation of popular video generation and understanding models.
  • Figure 3: Questionnaire used for collecting impossible text prompts for IPV-Txt.
  • Figure 4: Distribution of the Prompt Suite Across the Taxonomy.
  • Figure 5: Sources of Impossible Videos.
  • ...and 15 more figures