Table of Contents
Fetching ...

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

TL;DR

MUSE introduces an efficient, plug-and-play multi-scale learning framework for text-video retrieval by generating a feature pyramid from the last CLIP feature map and employing a linear-time Mamba-based learner (ResMamba) to model cross-resolution correlations. The approach optimizes three components—multi-scale feature generation, scale-wise aggregation, and the gated residual Mamba block—to achieve state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet with favorable memory and compute characteristics. Through extensive ablations, the authors show that scale-wise aggregation, bidirectional scanning, and the Mamba family offer superior efficiency and accuracy compared to Transformer-based or other linear-attention baselines. The work demonstrates the practical potential of linear-time cross-resolution context modeling for TVR and provides insights into scalable multi-scale video understanding.

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

TL;DR

MUSE introduces an efficient, plug-and-play multi-scale learning framework for text-video retrieval by generating a feature pyramid from the last CLIP feature map and employing a linear-time Mamba-based learner (ResMamba) to model cross-resolution correlations. The approach optimizes three components—multi-scale feature generation, scale-wise aggregation, and the gated residual Mamba block—to achieve state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet with favorable memory and compute characteristics. Through extensive ablations, the authors show that scale-wise aggregation, bidirectional scanning, and the Mamba family offer superior efficiency and accuracy compared to Transformer-based or other linear-attention baselines. The work demonstrates the practical potential of linear-time cross-resolution context modeling for TVR and provides insights into scalable multi-scale video understanding.

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.
Paper Structure (20 sections, 7 equations, 4 figures, 11 tables)

This paper contains 20 sections, 7 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: (a) Illustration of multi-scale features. Giving the text query, the model without multi-scale features retrieves the relevant but incorrect video because the small but crucial object "torches" can not be identified by only using frame-level feature representation (e.g., [CLS] tokens). We visualize the token similarity of the word "torches" and our extracted multi-scale features by organizing the attention map in a feature pyramid style from resolution low to high. Our model aggregates patches of the object "torches" that have a green boundary from multiple granularities to finally build a correlation between word "torches" and its visual entity in the video; (b) Efficiency-performance comparisons. The horizontal axis reflects memory usage, and the vertical is the R@1 metric of text-to-video retrieval on the MSR-VTT dataset. Marker sizes are proportional to the number of tunable parameters. Memory and parameters are calculated only on video learners without adding the backbone.
  • Figure 2: Illustration of MUSE. Our proposed method consists of three modules applied after video backbones. The generation module generates multi-scale video features based on single-scale visual output. Then, for the aggregation module, we test three different aggregation manners to aggregate multi-scale features into a 1D sequence. Finally, we design a residual architecture with Mamba to capture crucial video information from different granularities.
  • Figure 3: Comparison of the memory usage among Transformer, Mamba, and Baseline. The baseline selected is CLIP4clipluo2022clip4clip with mean pooling for feature aggregation.
  • Figure 4: Visualization of text-video retrieval examples. We sorted results based on their similarity scores and visualized the rank one result. Green: correct with MUSE; Red: incorrect without MUSE. Crucial visual hints are marked with orange boxes.