Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li
TL;DR
Video-STAR tackles open-vocabulary action recognition by addressing two core challenges: cross-modal hallucinations from text-centric reasoning and difficulty distinguishing semantically similar actions. It introduces a unified framework that decomposes actions into discriminative sub-motions and dynamically utilizes domain-specific tools within a reinforcement learning loop (GRPO) guided by a hierarchical reward. The training pipeline combines multimodal CoT data generation, agentic supervised fine-tuning, and tool-aware RL to learn perception-grounded, category-specific reasoning. Empirical results across HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 show state-of-the-art performance and strong generalization, with detailed ablations validating the contributions of sub-motion decomposition and tool usage. The approach promises robust, scalable open-vocabulary video understanding with practical implications for surveillance, human-robot interaction, and content analysis.
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
