Table of Contents
Fetching ...

TimeRefine: Temporal Grounding with Time Refining Video LLM

Xizi Wang, Feng Cheng, Ziyang Wang, Huiyu Wang, Md Mohaiminul Islam, Lorenzo Torresani, Mohit Bansal, Gedas Bertasius, David Crandall

TL;DR

TimeRefine tackles the core challenge of precise temporal grounding in Video LLMs by reframing timestamp prediction as an iterative coarse-to-fine refinement process. It introduces a dual approach: progressive time refinements with offset predictions and an auxiliary L1-based temporal perception head that encourages closer predictions to ground truth, all in a architecture-agnostic framework. Empirically, TimeRefine yields consistent improvements on ActivityNet Captions and Charades-STA, and boosts performance when integrated with VTG-LLM variants, confirming its effectiveness and versatility. The work highlights the practical impact of reframing grounding objectives for better temporal localization in video-language systems, while noting the trade-off of processing more temporal tokens and suggesting future refinements to sequence design and broader QA integration.

Abstract

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.

TimeRefine: Temporal Grounding with Time Refining Video LLM

TL;DR

TimeRefine tackles the core challenge of precise temporal grounding in Video LLMs by reframing timestamp prediction as an iterative coarse-to-fine refinement process. It introduces a dual approach: progressive time refinements with offset predictions and an auxiliary L1-based temporal perception head that encourages closer predictions to ground truth, all in a architecture-agnostic framework. Empirically, TimeRefine yields consistent improvements on ActivityNet Captions and Charades-STA, and boosts performance when integrated with VTG-LLM variants, confirming its effectiveness and versatility. The work highlights the practical impact of reframing grounding objectives for better temporal localization in video-language systems, while noting the trade-off of processing more temporal tokens and suggesting future refinements to sequence design and broader QA integration.

Abstract

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Given a text query, existing Video LLMs directly predict the start and end timestamps, which often leads to imprecise localization results. Our model generates a coarse prediction initially and then progressively refines it via temporal offset prediction for more precise temporal localization.
  • Figure 2: Overview of the TimeRefine. Given a video and a textual user prompt, our model predicts an iterative time refinement sequence, i.e., an initial rough estimation of the boundary, followed by new predictions and offsets based on its previous predictions. The new predictions and offsets can help the model learn how to refine its predictions and correct its errors. Our empirical experiments show that such an iterative temporal refinement strategy can help enhance the temporal perception ability of Video LLMs. We also complement the Cross-Entropy-based next-token prediction head with an auxiliary prediction head using an L1 regression loss, which encourages the model to learn that closer predictions are preferable. The final prediction is derived from the last predicted segment and its offsets.
  • Figure 3: Zero-shot case study. We compare the output of VTimeLLM, VTG-LLM and TimeRefine on a video from Charades-STA dataset. Our method iteratively refines the segment predictions. The final prediction achieves an IoU of 0.95, which is the highest among all predictions.