TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding
TL;DR
This work tackles the practical bottlenecks of text-video retrieval by freezing a CLIP backbone and employing a parameter-efficient, token-merging strategy to curb temporal redundancy. The proposed Temporal Token Merging (TempMe) implements a Progressive Multi-Granularity framework with ImgMe (intra-frame) and ClipMe (cross- and intra-clip) blocks to progressively reduce tokens while learning unified spatio-temporal video representations. Across MSRVTT, ActivityNet, DiDeMo, and LSMDC, TempMe achieves state-of-the-art efficiency-accuracy trade-offs, reducing output tokens by up to 95% and GFLOPs by around 51%, while delivering notable R-Sum gains (e.g., ~4.4% in t2v) and speedups (up to ~1.8x). The method generalizes to full fine-tuning and integrates with video foundation-model setups, demonstrating broad applicability and practical potential for deployment in efficient video-text retrieval systems. Overall, TempMe provides a scalable, versatile approach to leverage image-language pretraining for video tasks with reduced computation and memory footprints, enabling faster and more accessible TVR in real-world settings.
Abstract
Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we reduce spatio-temporal redundancy and enhance temporal modeling across different frames, leading to improved efficiency and performance. Extensive experiments validate the superiority of our TempMe. Compared to previous parameter-efficient text-video retrieval methods, TempMe achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. The code is available at https://github.com/LunarShen/TempMe.
