Table of Contents
Fetching ...

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

TL;DR

Video-STAR tackles open-vocabulary action recognition by addressing two core challenges: cross-modal hallucinations from text-centric reasoning and difficulty distinguishing semantically similar actions. It introduces a unified framework that decomposes actions into discriminative sub-motions and dynamically utilizes domain-specific tools within a reinforcement learning loop (GRPO) guided by a hierarchical reward. The training pipeline combines multimodal CoT data generation, agentic supervised fine-tuning, and tool-aware RL to learn perception-grounded, category-specific reasoning. Empirical results across HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 show state-of-the-art performance and strong generalization, with detailed ablations validating the contributions of sub-motion decomposition and tool usage. The approach promises robust, scalable open-vocabulary video understanding with practical implications for surveillance, human-robot interaction, and content analysis.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

TL;DR

Video-STAR tackles open-vocabulary action recognition by addressing two core challenges: cross-modal hallucinations from text-centric reasoning and difficulty distinguishing semantically similar actions. It introduces a unified framework that decomposes actions into discriminative sub-motions and dynamically utilizes domain-specific tools within a reinforcement learning loop (GRPO) guided by a hierarchical reward. The training pipeline combines multimodal CoT data generation, agentic supervised fine-tuning, and tool-aware RL to learn perception-grounded, category-specific reasoning. Empirical results across HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 show state-of-the-art performance and strong generalization, with detailed ablations validating the contributions of sub-motion decomposition and tool usage. The approach promises robust, scalable open-vocabulary video understanding with practical implications for surveillance, human-robot interaction, and content analysis.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Paper Structure

This paper contains 25 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Key insight of Video-STAR. (a) MLLMs + CoT is prone to hallucinations due to over-reliance on text-centric reasoning while ignoring visual cues. (b) MLLMs + Tool-Augmented CoT mitigates hallucinations by integrating domain-specific tools to extract visual information. However, both (a) and (b) lack category-specific reasoning capabilities and struggle to distinguish semantically similar or complex actions. (c) Video-STAR enhances reasoning capacity by introducing contextual sub-motion decomposition, which disentangles actions into discriminative motion primitives. This enables fine-grained action discrimination and robust performance in open-vocabulary scenarios.
  • Figure 2: Pipeline of Video-STAR. (i) Introduce a three-stage sub-motion logic chain to construct tool-augmented reasoning data that decomposes actions into discriminative sub-motions. (ii) Pre-train the MLLMs on structured reasoning chains and fine-tune it for domain-specific adaptation. (iii) Adopt the GRPO algorithm for reinforcement learning, which optimizes a hierarchical reward function considering both tool-usage and sub-motion to ensure robust and consistent inference.
  • Figure 3: Tool Libirary. Given the input video, Video-STAR respectively adopts the YOLO 11 for human detection & pose estimation, and the Qwen API for action explanation & video description.
  • Figure 4: Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL-3B misclassifies action "turn" as "smile", while our Video-STAR-3B accurately identifies the correct action.
  • Figure 5: More Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL-3B misclassifies "push" as "play", while our Video-STAR-3B accurately identifies the correct action.
  • ...and 2 more figures