Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Huabin Liu, Filip Ilievski, Cees G. M. Snoek
TL;DR
This work introduces a novel video-grounded entailment tree framework for commonsense VQA that explicitly grounds each reasoning step to video fragments. By transforming each answer option into declarative statements and recursively decomposing them into verifiable sub-statements, the method constructs an intelligible reasoning tree whose nodes are checked against localized video evidence via a video-language verifier. A dynamic expansion mechanism prunes unhelpful decompositions to improve efficiency, while a de-biasing procedure using LLM rewriting ensures fair benchmarking by mitigating textual shortcuts. Experiments across multiple datasets and VLMs show consistent gains and competitive performance with far fewer parameters, with especially strong improvements on temporal and causal questions, and robust behavior on de-biased sets. The approach also delivers interpretable reasoning traces, enabling verification of the model’s decision paths and grounding quality.
Abstract
This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
