Table of Contents
Fetching ...

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning

Zhihao He, Tianyao He, Yun Xu, Tieyuan Chen, Huabin Liu, Chaofan Gan, Zuxuan Wu, Weiyao Lin

TL;DR

This work tackles the problem of hallucinations in video reasoning caused by spatio-temporal incompleteness within a single video. It introduces a structured multi-video framework comprising a Video Structuring Module that converts videos into spatio-temporal graphs and a Graph Fusion Module that integrates related videos into graph tokens fed to a large language model via a structured multi-video prompt. The approach is validated across multiple video QA benchmarks, with ablations showing gains from both the VSM and GFM components and an efficient, relatively small training dataset (~87K samples). The results demonstrate improved reliability and accuracy in complex video understanding tasks, and the authors provide open-source code to facilitate adoption in the community.

Abstract

Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models. Code will be open-sourced at https://github.com/ziHoHe/SMV-CR.

Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning

TL;DR

This work tackles the problem of hallucinations in video reasoning caused by spatio-temporal incompleteness within a single video. It introduces a structured multi-video framework comprising a Video Structuring Module that converts videos into spatio-temporal graphs and a Graph Fusion Module that integrates related videos into graph tokens fed to a large language model via a structured multi-video prompt. The approach is validated across multiple video QA benchmarks, with ablations showing gains from both the VSM and GFM components and an efficient, relatively small training dataset (~87K samples). The results demonstrate improved reliability and accuracy in complex video understanding tasks, and the authors provide open-source code to facilitate adoption in the community.

Abstract

Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video's knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models. Code will be open-sourced at https://github.com/ziHoHe/SMV-CR.

Paper Structure

This paper contains 22 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Video question-answering pipeline under different video collaboration strategies. (a) Single-video reasoning pipeline; (b) Direct multi-video collaboration pipeline: concatenate multiple video's visual tokens, which is burdensome; (c) Structured multi-video collaboration pipeline (ours).
  • Figure 2: Video question answering examples from video language models.Single-video reasoning (Left): In video-1, the environmental visual cues are hard for the model to perceive, leading to a 'sports team' hallucination based on only the textual query and linguistic priors. In Video 2, ice-related visuals are missing, and limited bartending knowledge causes the model to skip the question. Multi-video reasoning (Right): Introducing relevant videos allows the model to complete and summarize domain-specific knowledge (such as environmental protection or bartending in this case), leading to more reliable and accurate answers.
  • Figure 3: Multi-video collaborative reasoning framework. Together with the target video, $N$ related videos are retrieved to facilitate the reasoning process. First, we design the Video Structuring Module to obtain the structured video representation. Then, the Graph Fusion Module fuses the structure information and the related videos' information to get the video graph tokens. Finally, according to the designed prompts, the graph tokens, visual tokens, and text tokens are arranged as input to the large language model for question answering.
  • Figure 4: Video captioning prompts. We refer to the design outlined in zhang2024videoinstructiontuningsynthetic to create the prompts used to extract captions from videos. The prompts are divided into two parts: the system prompt and the user message. In the system prompt, we define the task of video captioning and provide corresponding guidelines along with a standardized output format. For the output format, the program randomly selects contents in green font as the normalized format for reference during each process of captioning. For the user message, we utilize $<$VIDEO_TOKENS$>$ as the video tokens, and we provide a concise instruction to the model, then generate a detailed description for the video.
  • Figure 5: Structured multi-video prompts. We properly integrate the multi-modal tokens, together with the prompt guidance, to form an LLM-friendly input.
  • ...and 8 more figures