Table of Contents
Fetching ...

Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate

TL;DR

This paper introduces Defeasible Video Entailment (DVidE), a task that requires VLMMs to revise video–hypothesis entailment as new evidence arrives, capturing defeasible reasoning in multimodal, temporal contexts. It proposes a Chain of Counterfactual Thought (CoCT) framework for classification, augmented by ASR-enhanced video content integration and rationale refinement via LLMs, plus an LLM-guided ASR-integrated approach for generation. The authors also present the DVidE benchmark, built atop VIOLIN with strengthener/weakener annotations and an LLM-based evaluation metric for generation, reporting substantial gains over baselines in both tasks (e.g., ~29.7% accuracy improvement on classification; generation updates rising from ~7.5% to ~60.6%). Collectively, these contributions advance dynamic, multimodal reasoning in VLMMs and provide datasets and evaluation tools to study defeasible video inference in practical settings.

Abstract

Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.

Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

TL;DR

This paper introduces Defeasible Video Entailment (DVidE), a task that requires VLMMs to revise video–hypothesis entailment as new evidence arrives, capturing defeasible reasoning in multimodal, temporal contexts. It proposes a Chain of Counterfactual Thought (CoCT) framework for classification, augmented by ASR-enhanced video content integration and rationale refinement via LLMs, plus an LLM-guided ASR-integrated approach for generation. The authors also present the DVidE benchmark, built atop VIOLIN with strengthener/weakener annotations and an LLM-based evaluation metric for generation, reporting substantial gains over baselines in both tasks (e.g., ~29.7% accuracy improvement on classification; generation updates rising from ~7.5% to ~60.6%). Collectively, these contributions advance dynamic, multimodal reasoning in VLMMs and provide datasets and evaluation tools to study defeasible video inference in practical settings.

Abstract

Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.

Paper Structure

This paper contains 34 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: A defeasibility example in video entailment.
  • Figure 2: The architecture of Chain of Counterfactual Thought Classification Framework, including three modules: Counterfactual Thought-Induced Rational Generation, ASR-Enhanced Video Content Integration, and Rationale Refinement and Selection.
  • Figure 3: The architecture of LLM-Guided ASR-Integrated Generation Framework, including ASR-Enhanced Video Content Integration, and LLM-Refined Update Generation.
  • Figure 4: Prompt used for the Classification Task across VLMMs.
  • Figure 5: Prompt used for baselines in the Generation Task.
  • ...and 5 more figures