Table of Contents
Fetching ...

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, Guiguang Ding

TL;DR

This work tackles the practical bottlenecks of text-video retrieval by freezing a CLIP backbone and employing a parameter-efficient, token-merging strategy to curb temporal redundancy. The proposed Temporal Token Merging (TempMe) implements a Progressive Multi-Granularity framework with ImgMe (intra-frame) and ClipMe (cross- and intra-clip) blocks to progressively reduce tokens while learning unified spatio-temporal video representations. Across MSRVTT, ActivityNet, DiDeMo, and LSMDC, TempMe achieves state-of-the-art efficiency-accuracy trade-offs, reducing output tokens by up to 95% and GFLOPs by around 51%, while delivering notable R-Sum gains (e.g., ~4.4% in t2v) and speedups (up to ~1.8x). The method generalizes to full fine-tuning and integrates with video foundation-model setups, demonstrating broad applicability and practical potential for deployment in efficient video-text retrieval systems. Overall, TempMe provides a scalable, versatile approach to leverage image-language pretraining for video tasks with reduced computation and memory footprints, enabling faster and more accessible TVR in real-world settings.

Abstract

Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we reduce spatio-temporal redundancy and enhance temporal modeling across different frames, leading to improved efficiency and performance. Extensive experiments validate the superiority of our TempMe. Compared to previous parameter-efficient text-video retrieval methods, TempMe achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. The code is available at https://github.com/LunarShen/TempMe.

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

TL;DR

This work tackles the practical bottlenecks of text-video retrieval by freezing a CLIP backbone and employing a parameter-efficient, token-merging strategy to curb temporal redundancy. The proposed Temporal Token Merging (TempMe) implements a Progressive Multi-Granularity framework with ImgMe (intra-frame) and ClipMe (cross- and intra-clip) blocks to progressively reduce tokens while learning unified spatio-temporal video representations. Across MSRVTT, ActivityNet, DiDeMo, and LSMDC, TempMe achieves state-of-the-art efficiency-accuracy trade-offs, reducing output tokens by up to 95% and GFLOPs by around 51%, while delivering notable R-Sum gains (e.g., ~4.4% in t2v) and speedups (up to ~1.8x). The method generalizes to full fine-tuning and integrates with video foundation-model setups, demonstrating broad applicability and practical potential for deployment in efficient video-text retrieval systems. Overall, TempMe provides a scalable, versatile approach to leverage image-language pretraining for video tasks with reduced computation and memory footprints, enabling faster and more accessible TVR in real-world settings.

Abstract

Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we reduce spatio-temporal redundancy and enhance temporal modeling across different frames, leading to improved efficiency and performance. Extensive experiments validate the superiority of our TempMe. Compared to previous parameter-efficient text-video retrieval methods, TempMe achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. The code is available at https://github.com/LunarShen/TempMe.
Paper Structure (48 sections, 2 equations, 7 figures, 17 tables)

This paper contains 48 sections, 2 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: (a) An example illustrates the large temporal redundancy between adjacent frames. Identical subjects are highlighted in the same color. (b) Current methods treat video input as a sequence of multiple sampled frames, causing high complexity due to the large number of tokens. (c) In contrast, our TempMe reduces temporal redundancy by progressively merging redundant tokens in adjacent video clips. (d) With CLIP-ViT-B/16 on MSRVTT, our TempMe reaches state-of-the-art performance with minimal computational overhead. R-Sum is the sum of R@1, R@5, and R@10.
  • Figure 2: Overview of our proposed TempMe. We introduce a Progressive Multi-Granularity (PMG) framework consisting of both image merging and clip merging stages. In the image merging stage, ImgMe Block merges redundant spatial tokens within a single frame. Following this, ClipMe Block progressively forms new clips from adjacent ones, facilitating video-level feature learning and reducing temporal redundancy by merging tokens across different frames.
  • Figure 3: ClipMe Block. Given an input of $f$ clips $\mathbb{R}^{f \times N \times D}$, the cross-clip merging step merges tokens from all clips to form a new clip $\mathbb{R}^{1 \times fNR_c \times D}$. Subsequently, the intra-clip merging step merges tokens within this newly formed clip, producing $\mathbb{R}^{1 \times fNR_cR_I \times D}$. If the input contains only one clip, the cross-clip merging is skipped.
  • Figure 4: Qualitative comparisons on MSRVTT with CLIP-ViT-B/16. Patches that share the same inner and border color are merged. TempMe merges tokens of similar elements across frames.
  • Figure 5: Hyper-parameters analysis for text-to-video results on MSR-VTT with VIT-B/32. Each hyper-parameter is evaluated while keeping all other hyper-parameters fixed.
  • ...and 2 more figures