Table of Contents
Fetching ...

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

Yifei Xing, Xiangyuan Lan, Ruiping Wang, Dongmei Jiang, Wenjun Huang, Qingfang Zheng, Yaowei Wang

TL;DR

This work proposes Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information and shows lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks.

Abstract

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

TL;DR

This work proposes Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information and shows lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks.

Abstract

Mamba-based architectures have shown to be a promising new direction for deep learning models owing to their competitive performance and sub-quadratic deployment speed. However, current Mamba multi-modal large language models (MLLM) are insufficient in extracting visual features, leading to imbalanced cross-modal alignment between visual and textural latents, negatively impacting performance on multi-modal tasks. In this work, we propose Empowering Multi-modal Mamba with Structural and Hierarchical Alignment (EMMA), which enables the MLLM to extract fine-grained visual information. Specifically, we propose a pixel-wise alignment module to autoregressively optimize the learning and processing of spatial image-level features along with textual tokens, enabling structural alignment at the image level. In addition, to prevent the degradation of visual information during the cross-model alignment process, we propose a multi-scale feature fusion (MFF) module to combine multi-scale visual features from intermediate layers, enabling hierarchical alignment at the feature level. Extensive experiments are conducted across a variety of multi-modal benchmarks. Our model shows lower latency than other Mamba-based MLLMs and is nearly four times faster than transformer-based MLLMs of similar scale during inference. Due to better cross-modal alignment, our model exhibits lower degrees of hallucination and enhanced sensitivity to visual details, which manifests in superior performance across diverse multi-modal benchmarks. Code will be provided.
Paper Structure (20 sections, 14 equations, 5 figures, 6 tables)

This paper contains 20 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Given an image of a pizza and the prompt 'Describe the image in detail', we visualize intermediate visual features and their corresponding textual responses in Mamba-based MLLMs. Each upper row represents the magnitude of reconstructed spatial activations on each image, and each bottom highlights these patches on the original image. Each column (from left to right) represents intermediate layers with increasing depth. Cobrazhao2024cobra experiences a gradual loss of visual features as visual cues become blurred and unrecognizable, resulting in ineffective cross-modal alignment and producing an overly generalized (and hallucinated) answer. On the other hand, due to better cross-modal alignment, EMMA is capable of preserving visual details even in deeper layers of the LLM, highlighting areas such as the perimeter of the pizza tray, the overall pizza, and the spatula on the top right of the image. The resulting text demonstrates higher alignment to the image data in the form of sensitivity to visual details and spatial relationships.
  • Figure 2: Overview of EMMA. In addition to the textual responses, our model extracts a holistic visual feature from the Mamba LLM through a multi-scale feature fusion module that hierarchically combines intermediate visual features through cross-attention and Mamba projections. A pixel-wise alignment loss is calculated between the final visual feature and the original image to enable the learning of more fine-grained features and increased multi-modal alignment.
  • Figure 3: Comparison between Mamba and Mamba2 blocks dao2024transformers.
  • Figure 4: As shown, our method better highlights important characteristics of the image, such as the captions and horizontal and vertical axes. Cobra, on the other hand, gradually loses its visual activations on these characteristics, resulting in an incorrect reference to the plot as "calledKW".
  • Figure 5: Similar to the previous example, EMMA is able to retain focus on important visual details even during later intermediate layers, resulting in better sensitivity to fine-grained visual details and less visual hallucination.