Table of Contents
Fetching ...

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao

TL;DR

ActPrompt tackles the gap in video temporal grounding by enabling in-domain adaptation of Vision-Language Models to action-rich video data. It combines an efficient in-domain fine-tuning strategy with Action-Cue-Injected Temporal Prompt Learning (ACI and CTPL) to inject action cues and capture temporal motion information via prompts injected into the image encoder. Two pretext tasks—moment-query pairwise ranking and moment-query contrastive learning—drive the in-domain adaptation, while CTPL aggregates temporal context from neighboring frames to enhance motion modeling. Across moment retrieval and highlight detection on QVHighlights, Charades-STA, and TACoS, ActPrompt consistently improves strong baselines, validating its effectiveness and practicality; the authors provide the complete code in supplementary materials.

Abstract

Video temporal grounding is an emerging topic aiming to identify specific clips within videos. In addition to pre-trained video models, contemporary methods utilize pre-trained vision-language models (VLM) to capture detailed characteristics of diverse scenes and objects from video frames. However, as pre-trained on images, VLM may struggle to distinguish action-sensitive patterns from static objects, making it necessary to adapt them to specific data domains for effective feature representation over temporal grounding. We address two primary challenges to achieve this goal. Specifically, to mitigate high adaptation costs, we propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation, where downstream-adaptive features are learned through several pretext tasks. Furthermore, to integrate action-sensitive information into VLM, we introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns. Extensive experiments demonstrate that ActPrompt is an off-the-shelf training framework that can be effectively applied to various SOTA methods, resulting in notable improvements. The complete code used in this study is provided in the supplementary materials.

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

TL;DR

ActPrompt tackles the gap in video temporal grounding by enabling in-domain adaptation of Vision-Language Models to action-rich video data. It combines an efficient in-domain fine-tuning strategy with Action-Cue-Injected Temporal Prompt Learning (ACI and CTPL) to inject action cues and capture temporal motion information via prompts injected into the image encoder. Two pretext tasks—moment-query pairwise ranking and moment-query contrastive learning—drive the in-domain adaptation, while CTPL aggregates temporal context from neighboring frames to enhance motion modeling. Across moment retrieval and highlight detection on QVHighlights, Charades-STA, and TACoS, ActPrompt consistently improves strong baselines, validating its effectiveness and practicality; the authors provide the complete code in supplementary materials.

Abstract

Video temporal grounding is an emerging topic aiming to identify specific clips within videos. In addition to pre-trained video models, contemporary methods utilize pre-trained vision-language models (VLM) to capture detailed characteristics of diverse scenes and objects from video frames. However, as pre-trained on images, VLM may struggle to distinguish action-sensitive patterns from static objects, making it necessary to adapt them to specific data domains for effective feature representation over temporal grounding. We address two primary challenges to achieve this goal. Specifically, to mitigate high adaptation costs, we propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation, where downstream-adaptive features are learned through several pretext tasks. Furthermore, to integrate action-sensitive information into VLM, we introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns. Extensive experiments demonstrate that ActPrompt is an off-the-shelf training framework that can be effectively applied to various SOTA methods, resulting in notable improvements. The complete code used in this study is provided in the supplementary materials.
Paper Structure (24 sections, 5 equations, 5 figures, 4 tables)

This paper contains 24 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of feature extraction pipelines. The image encoder in VLM can capture more detailed features of static objects, but may not be able to distinguish action-sensitive objects from backgrounds. Action cues from other modalities are necessary to guide the image encoder in recognizing action-sensitive objects, making it fully utilized.
  • Figure 2: The overall framework of ActPrompt. To capture visual regions related to motions, Action Cue Injection (ACI) injects video- and verb-guided prompts from other encoders into VLM's image encoder as action cues. Context-aware Temporal Prompt Learning (CTPL) selects action-sensitive visual regions from consecutive frames via ACI and groups them to generate temporal prompt. The output representations are fed into training objectives of pretext tasks including moment-query pairwise ranking and moment-query contrastive learning for adaption to downstream grounding. The fine-tuned modules are marked with a red flame pattern, with the other modules frozen.
  • Figure 3: Illustration of context-aware temporal prompt learning. We sample the patch with the highest attention score to the action-sensitive prompt for each frame (left) and concatenate the sampled patch embeddings from the current frame and neighboring frames for learning context-aware temporal prompt (right).
  • Figure 4: Visualization of joint moment retrieval and highlight detection on QVHighlights over various baselines and their variants with our ActPrompt.
  • Figure 5: Visualization of attention maps from frozen CLIP's image encoder and in-domain fine-tuned encoder.