Table of Contents
Fetching ...

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Yue Zhang, Liqiang Jing, Jia Li, Yapeng Tian, Xinya Du, Yunhui Guo, Vibhav Gogate

Abstract

Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

Abstract

Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
Paper Structure (40 sections, 5 figures, 2 tables)

This paper contains 40 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Statistics of our benchmark. The benchmark includes 1,442 samples across 11 tasks, spanning low-level perception, mid-level cross-video matching, and high-level reasoning.
  • Figure 2: Overview of the SAMA framework. A text-only LLM planner, guided by task-adaptive skills that define when, how, and in what order to invoke tools, orchestrates eight visual tools across three groups: Perception, Detection, and Other tools. Skills determine the invocation strategy based on task type — for example, similarity tasks prioritize Visual Similarity before Video Reader, while counting tasks consult both Video Reader and Scene Graph to enable cross-modal conflict detection. Detected conflicts trigger adaptive re-reading, feeding corrective information back to the planner before the final answer is produced.
  • Figure 3: Overview of the MVX-Bench.
  • Figure 4: Case study of the conflict detection mechanism.
  • Figure 5: Case study of skill-guided tool selection.