Table of Contents
Fetching ...

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, Yali Wang

TL;DR

This paper tackles the challenge of multi-modal video understanding in autonomous driving, where open-domain questions and dynamic, multi-scale motions hinder existing MLLMs. It proposes Hierarchical Mamba Adaptation (H-MBA), comprising Context Mamba (C-Mamba) for extracting multi-granularity spatio-temporal contexts across high and low temporal resolutions, and Query Mamba (Q-Mamba) for adaptive fusion of these contexts into a learnable query fed to a frozen visual encoder and LLM. The method achieves state-of-the-art results on DRAMA risk localization (66.9% mIoU, +5.5% over prior SOTA) and strong performance on BDD-X among MLLM-based approaches, while maintaining favorable compute, thanks to Mamba-based efficiency. This approach demonstrates practical impact by enhancing risk localization and justification in driving scenarios, enabling more interpretable and capable autonomous driving systems.

Abstract

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

TL;DR

This paper tackles the challenge of multi-modal video understanding in autonomous driving, where open-domain questions and dynamic, multi-scale motions hinder existing MLLMs. It proposes Hierarchical Mamba Adaptation (H-MBA), comprising Context Mamba (C-Mamba) for extracting multi-granularity spatio-temporal contexts across high and low temporal resolutions, and Query Mamba (Q-Mamba) for adaptive fusion of these contexts into a learnable query fed to a frozen visual encoder and LLM. The method achieves state-of-the-art results on DRAMA risk localization (66.9% mIoU, +5.5% over prior SOTA) and strong performance on BDD-X among MLLM-based approaches, while maintaining favorable compute, thanks to Mamba-based efficiency. This approach demonstrates practical impact by enhancing risk localization and justification in driving scenarios, enabling more interpretable and capable autonomous driving systems.

Abstract

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
Paper Structure (21 sections, 6 equations, 4 figures, 4 tables)

This paper contains 21 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Motivation. (a) Previous models fail to give the correct description and justification with only single scale video input, while H-MBA combines multi-scale features and gets the most appropriate answer. (b) Performance of risk localization on Drama (Up/Down: Caption/Detection). Mamba blocks show advantage in the trade-off of performance and computation compared with attention modules, and our H-MBA achieves the best balance.
  • Figure 2: Pipeline of H-MBA framework. We design an extra H-Mamba Adaptation block to process video input, the hierarchical refers to high and low temporal resolution and different Mamba-style modules here. After the fuse of Q-Mamba adapter, the multi-scale features are aligned with text query prompt and sent to the LLM to get the final answer.
  • Figure 3: Illustration of three different space-time Mamba sequences in our C-Mamba. So we could get multi-granularity video features to fit for various tasks.
  • Figure 4: Visualization comparison for the output of different temporal processing modules. For some rare scenarios, such as lane changing caused by a bus occupying the road ahead, our H-MBA can recognize the action and provide reasonable explanations, while others focus on the wrong points.