Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
TL;DR
This paper introduces Defeasible Video Entailment (DVidE), a task that requires VLMMs to revise video–hypothesis entailment as new evidence arrives, capturing defeasible reasoning in multimodal, temporal contexts. It proposes a Chain of Counterfactual Thought (CoCT) framework for classification, augmented by ASR-enhanced video content integration and rationale refinement via LLMs, plus an LLM-guided ASR-integrated approach for generation. The authors also present the DVidE benchmark, built atop VIOLIN with strengthener/weakener annotations and an LLM-based evaluation metric for generation, reporting substantial gains over baselines in both tasks (e.g., ~29.7% accuracy improvement on classification; generation updates rising from ~7.5% to ~60.6%). Collectively, these contributions advance dynamic, multimodal reasoning in VLMMs and provide datasets and evaluation tools to study defeasible video inference in practical settings.
Abstract
Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
