VITED: Video Temporal Evidence Distillation
Yujie Lu, Yale Song, William Wang, Lorenzo Torresani, Tushar Nagarajan
TL;DR
VITED tackles the challenge of complex, temporally grounded reasoning in long-form VideoQA by generating and distilling temporal evidence chains. It automatically constructs an evidence pool from video segments across multiple granularities, refines and searches for coherent evidence chains, and trains a temporally aware VLM in a two-stage curriculum to answer questions while producing evidence chains. The approach yields state-of-the-art or competitive results on six VideoQA benchmarks, improves grounding accuracy, and offers interpretable chain-of-thought explanations. By integrating evidence generation, grounding, and reasoning into a single model, ViTED advances temporally grounded understanding of video content with practical implications for robust, explainable AI in video analysis.
Abstract
We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.
