Table of Contents
Fetching ...

MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang

TL;DR

This work proposes MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner, which achieves state-of-the-art performance and significantly outperforms existing baselines.

Abstract

Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

TL;DR

This work proposes MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner, which achieves state-of-the-art performance and significantly outperforms existing baselines.

Abstract

Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

Paper Structure

This paper contains 16 sections, 8 equations, 10 figures.

Figures (10)

  • Figure 1: MLVTG projects features of video-text pairs into a shared semantic space, narrows the semantic gap, and better aligns visual-text features. Its dual-branch architecture separately handles temporal localization and highlight detection.
  • Figure 2: Overview of the MLVTG Framework. The MLVTG first processes the input video and query using frozen feature encoders and projects them into a shared semantic space (Sec.\ref{['sec:feature_extraction']}). It then computes saliency scores directly in one branch, while in the other, fused features are processed by the MambaAligner (Sec.\ref{['sec:vision_mamba']}). The aligned features undergo semantic purification via the Mamba-based LLMRefiner (Sec.\ref{['sec:llm_refinement']}). Finally, task-specific heads (Sec.\ref{['sec:prediction']}) classify the refined features to produce results.
  • Figure 3: Performances on QVHighlights. The optimal results are bold, and the suboptimal results are underlined.
  • Figure 4: Temporal Localization Performances on Charades-STA.
  • Figure 5: Highlight Detection performance of Top-5 mAP on TVSum. "†" denotes utilizing the audio modality. The suboptimal results are underlined.
  • ...and 5 more figures