Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability
Zhaoyu Chen, Hongnan Lin, Yongwei Nie, Fei Ma, Xuemiao Xu, Fei Yu, Chengjiang Long
TL;DR
This work tackles temporal video grounding (TVG) by showing that IoU-centric optimization can erode action understanding. It introduces Invert4TVG, a framework that couples TVG with three inversion tasks—Verb Completion, Action Recognition, and Video Description—via a reinforcement learning approach built on Group Relative Policy Optimization. The inversion tasks provide semantic supervision aligned with TVG data, yielding stronger action-verb alignment and improved grounding accuracy, evidenced by state-of-the-art results on Charades-STA (e.g., 7B model achieving R1@0.7 of 51.4) and strong zero-shot performance on ActivityNet and QvHighlight. Ablation analyses demonstrate the benefits of joint inversion tasks, an optimal 20% inversion-rate, and the superiority of binary over cosine-based Invert-TVG rewards, highlighting the method’s potential to unify action understanding and temporal localization in LVLM-powered TVG. Overall, Invert4TVG advances TVG-LVLM integration, delivering robust semantic grounding without sacrificing efficiency, and sets a new direction for multi-task reinforcement learning in video-language understanding.
Abstract
Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model.
