Table of Contents
Fetching ...

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

TL;DR

VideoMind introduces a video-language agent that mimics human-like long-form video reasoning by decomposing tasks into planner-guided roles: Grounder for moment localization, Verifier for accuracy, and Answerer for response generation. The innovation hinges on Chain-of-LoRA, a lightweight, inference-time strategy that switches role-specific adapters within a single base model to balance efficiency and flexibility. Extensive experiments across 14 benchmarks show state-of-the-art or competitive performance in grounded video QA, video temporal grounding, and general VideoQA, with notable strength on long videos. Ablation studies confirm the value of the planner, grounder, and verifier, and demonstrate that Chain-of-LoRA offers substantial efficiency advantages over multi-model or fully fine-tuned baselines.

Abstract

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

TL;DR

VideoMind introduces a video-language agent that mimics human-like long-form video reasoning by decomposing tasks into planner-guided roles: Grounder for moment localization, Verifier for accuracy, and Answerer for response generation. The innovation hinges on Chain-of-LoRA, a lightweight, inference-time strategy that switches role-specific adapters within a single base model to balance efficiency and flexibility. Extensive experiments across 14 benchmarks show state-of-the-art or competitive performance in grounded video QA, video temporal grounding, and general VideoQA, with notable strength on long videos. Ablation studies confirm the value of the planner, grounder, and verifier, and demonstrate that Chain-of-LoRA offers substantial efficiency advantages over multi-model or fully fine-tuned baselines.

Abstract

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

Paper Structure

This paper contains 24 sections, 6 equations, 6 figures, 19 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustration of VideoMind’s Chain-of-LoRA reasoning strategy applied to a complex question for a 50-min long video. The problem is decomposed by Planner and distributed to Grounder, Verifier, and Answerer to systematically localize, verify, and interpret the relevant video moments. Such a role-based pipeline enables more human-like video reasoning compared with the pure textual CoT process.
  • Figure 2: The overall workflow of VideoMind. Given a video and a query, VideoMind adaptively activates different roles (Planner $\to$ Grounder $\to$ Verifier $\to$ Answerer in this case) and perform step-by-step reasoning by calling individual modules.
  • Figure 3: The Planner coordinates other roles based on the query, providing three modes and rephrasing tailored for different needs.
  • Figure 4: Detailed architecture of the timestamp decoder. This module accepts hidden states of both the frame tokens and the <REG> token, decoding them into the start and end timestamps.
  • Figure 5: The grounder generates multiple candidate moments, which are then refined by applying a zoom-in strategy and evaluated by Verifier to select the best moment.
  • ...and 1 more figures