MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang; Meng Cao; Jinfa Huang; Ruyang Liu; Peng Jin; Ge Li; Xiaodan Liang

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li, Xiaodan Liang

TL;DR

MUSE introduces an efficient, plug-and-play multi-scale learning framework for text-video retrieval by generating a feature pyramid from the last CLIP feature map and employing a linear-time Mamba-based learner (ResMamba) to model cross-resolution correlations. The approach optimizes three components—multi-scale feature generation, scale-wise aggregation, and the gated residual Mamba block—to achieve state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet with favorable memory and compute characteristics. Through extensive ablations, the authors show that scale-wise aggregation, bidirectional scanning, and the Mamba family offer superior efficiency and accuracy compared to Transformer-based or other linear-attention baselines. The work demonstrates the practical potential of linear-time cross-resolution context modeling for TVR and provides insights into scalable multi-scale video understanding.

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 4 figures, 11 tables)

This paper contains 20 sections, 7 equations, 4 figures, 11 tables.

Introduction
Related Works
Methodology
Overview
Multi-scale Feature Aggregation
Mamba As Video Learner
Experiment
Experimental Settings
Performance Comparison
Ablative Analysis of Correlation Modeling
Ablative Analysis of Scan Strategies
Ablative Analysis of Scale Combination
Visualization Results
Conclusion
Acknowledgments
...and 5 more sections

Figures (4)

Figure 1: (a) Illustration of multi-scale features. Giving the text query, the model without multi-scale features retrieves the relevant but incorrect video because the small but crucial object "torches" can not be identified by only using frame-level feature representation (e.g., [CLS] tokens). We visualize the token similarity of the word "torches" and our extracted multi-scale features by organizing the attention map in a feature pyramid style from resolution low to high. Our model aggregates patches of the object "torches" that have a green boundary from multiple granularities to finally build a correlation between word "torches" and its visual entity in the video; (b) Efficiency-performance comparisons. The horizontal axis reflects memory usage, and the vertical is the R@1 metric of text-to-video retrieval on the MSR-VTT dataset. Marker sizes are proportional to the number of tunable parameters. Memory and parameters are calculated only on video learners without adding the backbone.
Figure 2: Illustration of MUSE. Our proposed method consists of three modules applied after video backbones. The generation module generates multi-scale video features based on single-scale visual output. Then, for the aggregation module, we test three different aggregation manners to aggregate multi-scale features into a 1D sequence. Finally, we design a residual architecture with Mamba to capture crucial video information from different granularities.
Figure 3: Comparison of the memory usage among Transformer, Mamba, and Baseline. The baseline selected is CLIP4clipluo2022clip4clip with mean pooling for feature aggregation.
Figure 4: Visualization of text-video retrieval examples. We sorted results based on their similarity scores and visualized the rank one result. Green: correct with MUSE; Red: incorrect without MUSE. Crucial visual hints are marked with orange boxes.

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

TL;DR

Abstract

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)