Table of Contents
Fetching ...

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian, Yanhua Cheng, Yi Liu, Xinglin Hou, Quan Chen, Han Li

TL;DR

This work tackles the efficiency–effectiveness trade-off in text-to-video retrieval by introducing EERCF, a coarse-to-fine visual representation framework built on CLIP. A key novelty is the parameter-free Text-Gated Interaction Block (TIB) that yields fine-grained, text-conditioned video features without extra learning parameters, coupled with a Pearson Constraint to reduce intra-feature redundancy. Retrieval follows a two-stage recall–rerank pipeline: fast recall using text-agnostic coarse representations, then text-driven re-ranking using frame- and patch-level features, yielding strong accuracy with dramatically reduced FLOPs. Across MSRVTT, VATEX, MSVD, and ActivityNet, EERCF achieves competitive or superior performance while being up to about 50x more efficient, highlighting practical impact for scalable multimodal retrieval systems.

Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

TL;DR

This work tackles the efficiency–effectiveness trade-off in text-to-video retrieval by introducing EERCF, a coarse-to-fine visual representation framework built on CLIP. A key novelty is the parameter-free Text-Gated Interaction Block (TIB) that yields fine-grained, text-conditioned video features without extra learning parameters, coupled with a Pearson Constraint to reduce intra-feature redundancy. Retrieval follows a two-stage recall–rerank pipeline: fast recall using text-agnostic coarse representations, then text-driven re-ranking using frame- and patch-level features, yielding strong accuracy with dramatically reduced FLOPs. Across MSRVTT, VATEX, MSVD, and ActivityNet, EERCF achieves competitive or superior performance while being up to about 50x more efficient, highlighting practical impact for scalable multimodal retrieval systems.

Abstract

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
Paper Structure (23 sections, 9 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Effectiveness and efficiency for text-to-video retrieval models. We evaluate our approach under the settings of MSRVTT-1K-Test and backbone CLIP(ViT-B/32). The current trend of mainstream is reflected from the lower left to the upper right corner. Our method achieves the best balance, positioned at the upper left corner.
  • Figure 2: Overview of the proposed EERCF framework. EERCF mainly consists of two parts: 1) Coarse-grained and fine-grained visual representations obtained from the TIB module for the recall-reranking pipeline. 2) Inter- and intra-feature supervision loss for optimizing representation learning. Best viewed in color.
  • Figure 3: Retrieval performance on MSRVTT-1K-Test based on different number of re-ranking candidates k.
  • Figure 4: Visualization of the coarse-to-fine retrieval process on MSRVTT-1K-Test. Green boxes mean the ground truth video corresponding to the query text, and red boxes denote confused videos. More results are provided in the supplementary material.
  • Figure 5: Visualization of good cases on MSRVTT-1K-Test. Green boxes mean the groundtruth video corresponding to the query text, and red boxes denote confused videos.
  • ...and 1 more figures