VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
TL;DR
VideoMind introduces a video-language agent that mimics human-like long-form video reasoning by decomposing tasks into planner-guided roles: Grounder for moment localization, Verifier for accuracy, and Answerer for response generation. The innovation hinges on Chain-of-LoRA, a lightweight, inference-time strategy that switches role-specific adapters within a single base model to balance efficiency and flexibility. Extensive experiments across 14 benchmarks show state-of-the-art or competitive performance in grounded video QA, video temporal grounding, and general VideoQA, with notable strength on long videos. Ablation studies confirm the value of the planner, grounder, and verifier, and demonstrate that Chain-of-LoRA offers substantial efficiency advantages over multi-model or fully fine-tuned baselines.
Abstract
Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.
