Table of Contents
Fetching ...

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Huabin Liu, Filip Ilievski, Cees G. M. Snoek

TL;DR

This work introduces a novel video-grounded entailment tree framework for commonsense VQA that explicitly grounds each reasoning step to video fragments. By transforming each answer option into declarative statements and recursively decomposing them into verifiable sub-statements, the method constructs an intelligible reasoning tree whose nodes are checked against localized video evidence via a video-language verifier. A dynamic expansion mechanism prunes unhelpful decompositions to improve efficiency, while a de-biasing procedure using LLM rewriting ensures fair benchmarking by mitigating textual shortcuts. Experiments across multiple datasets and VLMs show consistent gains and competitive performance with far fewer parameters, with especially strong improvements on temporal and causal questions, and robust behavior on de-biased sets. The approach also delivers interpretable reasoning traces, enabling verification of the model’s decision paths and grounding quality.

Abstract

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

TL;DR

This work introduces a novel video-grounded entailment tree framework for commonsense VQA that explicitly grounds each reasoning step to video fragments. By transforming each answer option into declarative statements and recursively decomposing them into verifiable sub-statements, the method constructs an intelligible reasoning tree whose nodes are checked against localized video evidence via a video-language verifier. A dynamic expansion mechanism prunes unhelpful decompositions to improve efficiency, while a de-biasing procedure using LLM rewriting ensures fair benchmarking by mitigating textual shortcuts. Experiments across multiple datasets and VLMs show consistent gains and competitive performance with far fewer parameters, with especially strong improvements on temporal and causal questions, and robust behavior on de-biased sets. The approach also delivers interpretable reasoning traces, enabling verification of the model’s decision paths and grounding quality.

Abstract

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
Paper Structure (18 sections, 4 equations, 14 figures, 15 tables)

This paper contains 18 sections, 4 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Given a video questioning answering task, our framework performs explicit reasoning over an entailment tree, where answer options are transformed into statements. These statements are then recursively decomposed and verified based on video-grounded evidence relevant to the question.
  • Figure 2: Overview of our framework. (a) The generation of the entailment tree, where statements are recursively decomposed until the tree reaches its max depth or meets the stop criterion. (b) The process of video-language entailment verification: the input video is first converted into textual descriptions. Each caption is then parsed into structured semantics. Given the fact statement as a query, we retrieve the anchor frame. Then, based on the temporal or causal navigation indicated by questions, the visual evidence moment can be grounded.
  • Figure 3: Illustration of dynamic tree generation and backtrace. In Step-3, when the proof score of the left statement calculated from its child nodes is less than its direct score ($0.63<0.8$), its decomposition is pruned and stops.
  • Figure 4: Illustration of commonsense bias in video question answering. The example is selected from the NExT-QA dataset.
  • Figure 5: Prompt used for rewriting answers on NExT-QA.
  • ...and 9 more figures