Table of Contents
Fetching ...

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

TL;DR

RAP tackles the high cost of adapting CLIP to text-video retrieval by freezing the image-text backbone and introducing two targeted adapters. Low-Rank Modulation (LoRM) imposes temporal sparsity on frame features through a compact, low-rank factorization, while Asynchronous Self-Attention (ASA) builds cross-frame temporal correlations by selectively warping a subset of patches with learnable offsets. A text-conditioned patch-selection mechanism further focuses computation on the most relevant regions, enabling efficient and effective cross-modal alignment. Experiments across four TVR datasets show RAP matches or surpasses fully fine-tuned baselines and other parameter-efficient methods, with orders of magnitude fewer trainable parameters, and a lighter variant RAP_light that reduces cost even further.

Abstract

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

TL;DR

RAP tackles the high cost of adapting CLIP to text-video retrieval by freezing the image-text backbone and introducing two targeted adapters. Low-Rank Modulation (LoRM) imposes temporal sparsity on frame features through a compact, low-rank factorization, while Asynchronous Self-Attention (ASA) builds cross-frame temporal correlations by selectively warping a subset of patches with learnable offsets. A text-conditioned patch-selection mechanism further focuses computation on the most relevant regions, enabling efficient and effective cross-modal alignment. Experiments across four TVR datasets show RAP matches or surpasses fully fine-tuned baselines and other parameter-efficient methods, with orders of magnitude fewer trainable parameters, and a lighter variant RAP_light that reduces cost even further.

Abstract

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods.
Paper Structure (12 sections, 8 equations, 5 figures, 11 tables)

This paper contains 12 sections, 8 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Top: Illustrations of temporal sparsity. We visualize the modulation weight w/ or w/o low-rank decomposition. Down: Illustrations of temporal correlation. The query patch is marked by the yellow cross and the similarity map within other frames are plotted.
  • Figure 2: Text-to-video retrieval performance on MSR-VTT dataset. Marker sizes are proportional to the number of tunable parameters.
  • Figure 3: An overview of RAP. (a) LoRM sets up learnable shift parameters $\boldsymbol{c}_\text{v}$ and scale parameters $\boldsymbol{s}_\text{v}$ to calibrate the vanilla CLIP features. For the temporally sparse requirement, $\boldsymbol{c}_\text{v}$ and $\boldsymbol{s}_\text{v}$ are generated by low-rank decomposition on the temporal dimension. (b) Asynchronous self-attention first filters out patch set $\mathcal{S}_t$ via text-conditioned selection. Then, the filtered patches are warped based on the learnable patch offset $\gamma$ and temporal offset $\delta$.
  • Figure 4: Illustrations of temporal sparsity. We visualize the modulation weight w/ or w/o low-rank decomposition.
  • Figure 5: Illustrations of temporal correlation. The query patch is marked by the yellow cross and the similarity map within other frames are plotted.