Table of Contents
Fetching ...

VideoPro: Adaptive Program Reasoning for Long Video Understanding

Chenglin Li, Feng Han, Yikun Wang, Ruilin Li, Shuai Dong, Haowen Hou, Haitao Li, Qianglong Chen, Feng Tao, Jingqi Tong, Yin Zhang, Jiaqi Wang

TL;DR

VideoPro tackles long-form video understanding by combining adaptive reasoning and self-refinement to balance speed and accuracy. It employs a modular video toolset to ground queries with selective, multi-step visual programming only when necessary, and uses a confidence-guided refinement loop to repair failures. The two-stage training with SFT and GRPO enables learnable routing between native VideoLLM reasoning and program-based workflows, achieving strong results on LVBench, VideoMME-L, and other benchmarks. This approach offers practical benefits for scalable, reliable multi-modal reasoning in long videos, with potential impact on real-world video QA systems.

Abstract

Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.

VideoPro: Adaptive Program Reasoning for Long Video Understanding

TL;DR

VideoPro tackles long-form video understanding by combining adaptive reasoning and self-refinement to balance speed and accuracy. It employs a modular video toolset to ground queries with selective, multi-step visual programming only when necessary, and uses a confidence-guided refinement loop to repair failures. The two-stage training with SFT and GRPO enables learnable routing between native VideoLLM reasoning and program-based workflows, achieving strong results on LVBench, VideoMME-L, and other benchmarks. This approach offers practical benefits for scalable, reliable multi-modal reasoning in long videos, with potential impact on real-world video QA systems.

Abstract

Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.

Paper Structure

This paper contains 32 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Comparison of prior methods and VideoPro: effective and reliable adaptive reasoning with refinement.
  • Figure 2: Distribution of correct vs. error predictions across confidence on LongVideoBench. The proportion of correct predictions exceeds errors in $[0.7, 0.8)$ interval, and exceeds $90\%$ when confidence is above $0.9$.
  • Figure 3: (a) Adaptive Reasoning & Self-Refinement: VideoPro dynamically selects between Native VideoLLM and Multi-step visual program reasoning based on query complexity. Self-refinement is employed to correct failed executions and low-confidence reasoning programs. (b) Training Pipeline: The process involves (1) SFT on the reason-and-refine dataset, and (2) GRPO to optimize rewards for correctness, format validity, and consistency.
  • Figure 4: Performance at varying confidence thresholds. VideoPro exhibits robust performance on the Long Video Benchmark across the wide interval of $[0.4, 0.9]$.
  • Figure 5: Accuracy on LongVideoBench and VideoMME across different video durations.
  • ...and 9 more figures