Table of Contents
Fetching ...

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, Yang Liu

TL;DR

This work tackles video temporal grounding without training data by combining large language models (LLMs) for sub-event reasoning with vision-language models (VLMs) for localization. The method decomposes queries into sub-events, reasons their temporal relations with an LLM, and localizes each sub-event with a VLM that explicitly models dynamic transitions and static post-states through dynamic and static scoring. Predictions are filtered and integrated according to the inferred event order and relationships, yielding a final localization. The approach achieves state-of-the-art zero-shot performance on Charades-STA and ActivityNet Captions and demonstrates strong cross-dataset and out-of-distribution generalization, highlighting the practical potential of training-free video grounding, albeit with reliance on LLM reliability.

Abstract

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

TL;DR

This work tackles video temporal grounding without training data by combining large language models (LLMs) for sub-event reasoning with vision-language models (VLMs) for localization. The method decomposes queries into sub-events, reasons their temporal relations with an LLM, and localizes each sub-event with a VLM that explicitly models dynamic transitions and static post-states through dynamic and static scoring. Predictions are filtered and integrated according to the inferred event order and relationships, yielding a final localization. The approach achieves state-of-the-art zero-shot performance on Charades-STA and ActivityNet Captions and demonstrates strong cross-dataset and out-of-distribution generalization, highlighting the practical potential of training-free video grounding, albeit with reliance on LLM reliability.

Abstract

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.
Paper Structure (17 sections, 5 equations, 3 figures, 9 tables)

This paper contains 17 sections, 5 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: (a) Evaluation results of existing methods and our method under the IID and OOD setting on the Charades-STA dataset. (b) Evaluation results of the naive baseline on the ActivityNet Datasets when the query describes single or multiple events. (c) The query-frame similarity obtained from the BLIP-2 Q-Former. The naive baseline based on BLIP-2 tends to predict the static parts of the video and ignores the dynamic transitions.
  • Figure 2: The pipeline of the proposed method. Firstly, the LLM prompting leverages the large language model (LLM) to analyze sub-events contained in the query and reason the order and the temporal relationship of these sub-events. Then, the VLM localizer uses the vision language models to localize the sub-event in the video. The VLM localizer calculates the similarity between the video frames and the sub-event descriptions, enumerates event proposals in the video, and explicitly considers both dynamic transition and static status post-transition when measuring the similarity between the proposal and the text query, thus selecting proposals as the localization results. Finally, we filter and integrate the results of the VLM localizer based on the order and relationship of sub-events inferred by LLM to make the final prediction.
  • Figure 3: Qualitative results on the ActivityNet Captions dataset.