Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
Zhihao He, Tianyao He, Yun Xu, Tieyuan Chen, Huabin Liu, Chaofan Gan, Zuxuan Wu, Weiyao Lin
TL;DR
This work tackles the problem of hallucinations in video reasoning caused by spatio-temporal incompleteness within a single video. It introduces a structured multi-video framework comprising a Video Structuring Module that converts videos into spatio-temporal graphs and a Graph Fusion Module that integrates related videos into graph tokens fed to a large language model via a structured multi-video prompt. The approach is validated across multiple video QA benchmarks, with ablations showing gains from both the VSM and GFM components and an efficient, relatively small training dataset (~87K samples). The results demonstrate improved reliability and accuracy in complex video understanding tasks, and the authors provide open-source code to facilitate adoption in the community.
Abstract
Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models. Code will be open-sourced at https://github.com/ziHoHe/SMV-CR.
