Table of Contents
Fetching ...

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang

TL;DR

DGL introduces a parameter-efficient approach to text-video retrieval by generating dynamic cross-modal prompts from a shared latent space and applying a global-local video attention mechanism. By freezing the CLIP backbone and updating only lightweight prompts and a local-prompt generator, DGL achieves strong retrieval performance across MSR-VTT, VATEX, LSMDC, and ActivityNet, with substantially reduced trainable parameters (as low as ~0.83M) and competitive or superior results compared to fully finetuned or other PEFL methods. The method combines two local-prompt generation strategies (Unified Prompt Transformer and Unified Linear Projection) with a global-local attention scheme to capture both frame-level details and global temporal dynamics, enabling efficient cross-modal alignment. This work demonstrates the practical impact of cross-modal prompt tuning for scalable TVR deployment, reducing storage and computation while maintaining high accuracy.

Abstract

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

TL;DR

DGL introduces a parameter-efficient approach to text-video retrieval by generating dynamic cross-modal prompts from a shared latent space and applying a global-local video attention mechanism. By freezing the CLIP backbone and updating only lightweight prompts and a local-prompt generator, DGL achieves strong retrieval performance across MSR-VTT, VATEX, LSMDC, and ActivityNet, with substantially reduced trainable parameters (as low as ~0.83M) and competitive or superior results compared to fully finetuned or other PEFL methods. The method combines two local-prompt generation strategies (Unified Prompt Transformer and Unified Linear Projection) with a global-local attention scheme to capture both frame-level details and global temporal dynamics, enabling efficient cross-modal alignment. This work demonstrates the practical impact of cross-modal prompt tuning for scalable TVR deployment, reducing storage and computation while maintaining high accuracy.

Abstract

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL
Paper Structure (14 sections, 10 equations, 5 figures, 3 tables)

This paper contains 14 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: In Fig (a), we observed that frame attention methods, like CLIP4Clip, often emphasize non-semantic corners, missing the protagonist's action. This led us to design the global-local video attention for capturing global-level cross-frame dynamics. Fig (b) showcases a performance comparison on MSRVTT: DGL outperforms six PEFL methods and fine-tuned CLIP4Clip while updating minimal parameters.
  • Figure 2: Overview of our Dynamic Global-Local prompt tuning Framework. DGL consists of the Local Prompt Generation, Text Encoder, and Visual Encoder. During downstream training, all the encoders are frozen, and only the parameter pictured with fire is trainable. The Local Prompt Generation ensures cross-modal interaction at the word-frame level, and the Global-Local Video Attention hints to the visual encoder to extract general video information from different perspectives.
  • Figure 3: Illustration of Global-Local Video Attention. The patches in the green mask serve as the queries in self-attention, and the patches in the orange mask are the key or value in self-attention. In the local frame attention, frame prompts serve as the query to investigate fine-grained local information in each frame; In global video attention, the global prompt acts as the query to excavate the global-level video information from all frames.
  • Figure 4: Visualization of text-video retrieval results. Frames in the green box are DGL R@1 results, while those in the red box are CLIP4clip R@1 results.
  • Figure 5: Visualization of global prompt, we plot the global prompt's attention weight on each frame. The red text in the query corresponds to the video's discriminative features.