Table of Contents
Fetching ...

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, Lei Hou, Juanzi Li

TL;DR

VidCoM presents a tuning-free framework that enables LLMs to reason about long, sparse videos by coupling lightweight visual tools with an instruction-oriented event localization method, InsOVER. By decomposing instructions and video content into sub-events and employing a Hungarian-matching-based refinement, VidCoM achieves efficient cross-modal alignment and strong reasoning performance $p_{ heta}(A|V,L,\mathcal{K})$. Across STAR (VideoQA) and ActivityNet-Captions (DVC), the approach delivers state-of-the-art results in few-shot settings and competitive performance against fully supervised baselines, illustrating practical benefits of reducing training requirements while maintaining robust world-knowledge reasoning. The work highlights the potential of modular multimodal tools and LLMs to jointly handle perception and knowledge-driven reasoning in video understanding, with publicly released code forthcoming.

Abstract

Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce \textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

TL;DR

VidCoM presents a tuning-free framework that enables LLMs to reason about long, sparse videos by coupling lightweight visual tools with an instruction-oriented event localization method, InsOVER. By decomposing instructions and video content into sub-events and employing a Hungarian-matching-based refinement, VidCoM achieves efficient cross-modal alignment and strong reasoning performance . Across STAR (VideoQA) and ActivityNet-Captions (DVC), the approach delivers state-of-the-art results in few-shot settings and competitive performance against fully supervised baselines, illustrating practical benefits of reducing training requirements while maintaining robust world-knowledge reasoning. The work highlights the potential of modular multimodal tools and LLMs to jointly handle perception and knowledge-driven reasoning in video understanding, with publicly released code forthcoming.

Abstract

Building models that comprehends videos and responds specific user instructions is a practical and challenging topic, as it requires mastery of both vision understanding and knowledge reasoning. Compared to language and image modalities, training efficiency remains a serious problem as existing studies train models on massive sparse videos paired with brief descriptions. In this paper, we introduce \textbf{VidCoM}, a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. Specifically, we reveal that the key to responding to specific instructions is focusing on relevant video events, and utilize two visual tools, structured scene graph generation and descriptive image caption generation, to gather and represent the event information. Thus, a LLM enriched with world knowledge is adopted as the reasoning agent to achieve the responses by performing multiple reasoning steps on specific video events. To address the difficulty of LLMs identifying video events, we further propose an Instruction-oriented Video Events Recognition (InsOVER) algorithm. This algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events, thereby enabling LLMs to interact effectively with extended videos. Extensive experiments on two typical video comprehension tasks show that the proposed tuning-free framework outperforms the pre-trained models including Flamingo-80B, to achieve the state-of-the-art performance. Our source code and system will be publicly available.
Paper Structure (26 sections, 5 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Average results on 100 randomly selected videos-captions from ActivityNet-Captions. Left: the counts of videos with average similarities of local frames. Right: the frequencies of words in captions.
  • Figure 2: Illustration of the process of VidCoM with an DVC example. Given the user instruction requesting the events regions with captions, the InsOVER S-1 algorithm is adopted firstly to initialize $n$ events. A then $T$ reasoning steps of LLM agent on the video events are performed based on the InsOVER S-2 to achieve the final response.
  • Figure 3: Illustration of the InsOVER algorithm, where the $stage_1$ initialize $3$ events automatically, and the $stage_2$ refine the events based on bipartite-graph matching between frames and assertions extracted from OpenIE model.
  • Figure 4: Ablation studies with various numbers of demonstrations and frames on STAR.
  • Figure 5: A case study of VidCoM on STAR.
  • ...and 2 more figures