DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
TL;DR
DGL introduces a parameter-efficient approach to text-video retrieval by generating dynamic cross-modal prompts from a shared latent space and applying a global-local video attention mechanism. By freezing the CLIP backbone and updating only lightweight prompts and a local-prompt generator, DGL achieves strong retrieval performance across MSR-VTT, VATEX, LSMDC, and ActivityNet, with substantially reduced trainable parameters (as low as ~0.83M) and competitive or superior results compared to fully finetuned or other PEFL methods. The method combines two local-prompt generation strategies (Unified Prompt Transformer and Unified Linear Projection) with a global-local attention scheme to capture both frame-level details and global temporal dynamics, enabling efficient cross-modal alignment. This work demonstrates the practical impact of cross-modal prompt tuning for scalable TVR deployment, reducing storage and computation while maintaining high accuracy.
Abstract
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL
