Table of Contents
Fetching ...

Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability

Zhaoyu Chen, Hongnan Lin, Yongwei Nie, Fei Ma, Xuemiao Xu, Fei Yu, Chengjiang Long

TL;DR

This work tackles temporal video grounding (TVG) by showing that IoU-centric optimization can erode action understanding. It introduces Invert4TVG, a framework that couples TVG with three inversion tasks—Verb Completion, Action Recognition, and Video Description—via a reinforcement learning approach built on Group Relative Policy Optimization. The inversion tasks provide semantic supervision aligned with TVG data, yielding stronger action-verb alignment and improved grounding accuracy, evidenced by state-of-the-art results on Charades-STA (e.g., 7B model achieving R1@0.7 of 51.4) and strong zero-shot performance on ActivityNet and QvHighlight. Ablation analyses demonstrate the benefits of joint inversion tasks, an optimal 20% inversion-rate, and the superiority of binary over cosine-based Invert-TVG rewards, highlighting the method’s potential to unify action understanding and temporal localization in LVLM-powered TVG. Overall, Invert4TVG advances TVG-LVLM integration, delivering robust semantic grounding without sacrificing efficiency, and sets a new direction for multi-task reinforcement learning in video-language understanding.

Abstract

Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model.

Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks Preserving Action Understanding Ability

TL;DR

This work tackles temporal video grounding (TVG) by showing that IoU-centric optimization can erode action understanding. It introduces Invert4TVG, a framework that couples TVG with three inversion tasks—Verb Completion, Action Recognition, and Video Description—via a reinforcement learning approach built on Group Relative Policy Optimization. The inversion tasks provide semantic supervision aligned with TVG data, yielding stronger action-verb alignment and improved grounding accuracy, evidenced by state-of-the-art results on Charades-STA (e.g., 7B model achieving R1@0.7 of 51.4) and strong zero-shot performance on ActivityNet and QvHighlight. Ablation analyses demonstrate the benefits of joint inversion tasks, an optimal 20% inversion-rate, and the superiority of binary over cosine-based Invert-TVG rewards, highlighting the method’s potential to unify action understanding and temporal localization in LVLM-powered TVG. Overall, Invert4TVG advances TVG-LVLM integration, delivering robust semantic grounding without sacrificing efficiency, and sets a new direction for multi-task reinforcement learning in video-language understanding.

Abstract

Temporal Video Grounding (TVG) aims to localize video segments corresponding to a given textual query, which often describes human actions. However, we observe that current methods, usually optimizing for high temporal Intersection-over-Union (IoU), frequently struggle to accurately recognize or understand the underlying actions in both the video and query, thus reducing the effectiveness of these methods. To address this, we propose a novel TVG framework that integrates inversion-based TVG as auxiliary objectives to maintain the model's action understanding ability. We introduce three kinds of inversion TVG tasks derived from the original TVG annotations: (1) Verb Completion, predicting masked verbs (actions) in queries given video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions containing query-relevant actions given video segments. These inversion tasks are entirely derived from the original TVG tasks and are probabilistically integrated with them within a reinforcement learning framework. By leveraging carefully designed reward functions, the model preserves its ability to understand actions, thereby improving the accuracy of temporal grounding. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1\% improvement in R1@0.7 on Charades-STA for a 3B model.

Paper Structure

This paper contains 17 sections, 3 theorems, 15 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

The Invert-TVG task minimizes a semantic loss $L_{\text{sem}} = \mathbb{E}_{(V,\tau) \sim D} [d(q', q)]$, where $d(\cdot, \cdot)$ is a distance metric (e.g., verb matching or KL divergence on embeddings). Then, the joint loss satisfies $L_{\text{joint}} \leq L_{\text{TVG}} + C$ for some constant $C

Figures (14)

  • Figure 1: Left side: A specific example of temporal video grounding. According to the model's reasoning process,it can be seen that our method achieves better understanding of actions in the video compared to VideoChat-R1 and Time-R1. Right side: statistical results demonstrating that Time-R1, which is optimized solely for the IoU loss, reduces action understanding accuracy (where VC, AR, and VD are the proposed three auxiliary inversion TVG tasks measuring multi-granularity action understanding ability). By introducing Inversion-TVG tasks, our method preserves action understanding ability and thus boosts TVG ability (as shown in R1@0.3, R1@0.5, and R1@0.7). Baseline is Qwen-2.5-VL-3B.
  • Figure 2: We propose three Invert-TVG tasks. By partially reversing the inputs and outputs of the TVG task we obtain Verb Completion, Action Recognition and Video Description, which reuse the original TVG dataset by taking ground truth video segments as input to reconstruct the target query related actions. The prompts for the three invert-TVG tasks are not identical. For VC, the verb in the query is removed, and the model is required to complete and fill in this verb. AR asks the model to directly estimate the verb in the video. VD requires the model to describe the video content containing action verbs in the query.
  • Figure 3: Overview of the proposed Invert4TVG framework. The LVLM dynamically chooses between TVG tasks and Invert-TVG tasks according to different probabilities. Whenever an Invert-TVG task is selected, one of the three variants VC, AR or VD is chosen with equal probability.
  • Figure 4: Performance of temporal video grounding on ActivityNet and QvHighlight. We compare our method with Time-R1 (the best-performing among previous methods). All the models are zero-shot tested.
  • Figure 5: The R1 accuracy curves. Blue, orange, and green show how the three R1 metrics evolve as the Invert-TVG task probability $(1-p)$ gradually increases.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Lemma 1: Semantic Alignment Improvement
  • proof
  • theorem 1: Pareto Superiority
  • proof
  • Corollary 1: Generalization Bound