Table of Contents
Fetching ...

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

Zhe Yang, Wenrui Li, Guanghui Cheng

TL;DR

This work addresses AVQA by introducing SHMamba, which integrates hyperbolic geometry to capture hierarchical audio-visual relationships with a Structured State Space Model to model global temporal dynamics. It comprises an adaptive curvature hyperbolic alignment module (HAM), a cross fusion block (CFB) for cross-modal interaction, and a Mamba-based temporal backbone, enabling efficient long-sequence processing. Empirical results on MUSIC-AVQA and AVQA show SHMamba achieves state-of-the-art or strong performance with substantially fewer parameters and FLOPs, supported by ablations, visualizations, and qualitative QA examples. The approach demonstrates the practical potential of combining hyperbolic embeddings with structured dynamical models for robust, scalable multimodal reasoning.

Abstract

The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

TL;DR

This work addresses AVQA by introducing SHMamba, which integrates hyperbolic geometry to capture hierarchical audio-visual relationships with a Structured State Space Model to model global temporal dynamics. It comprises an adaptive curvature hyperbolic alignment module (HAM), a cross fusion block (CFB) for cross-modal interaction, and a Mamba-based temporal backbone, enabling efficient long-sequence processing. Empirical results on MUSIC-AVQA and AVQA show SHMamba achieves state-of-the-art or strong performance with substantially fewer parameters and FLOPs, supported by ablations, visualizations, and qualitative QA examples. The approach demonstrates the practical potential of combining hyperbolic embeddings with structured dynamical models for robust, scalable multimodal reasoning.

Abstract

The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.
Paper Structure (29 sections, 10 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 29 sections, 10 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: In (a), the Euclidean distance and the graph distance between $x_1$ and $x_2$ are $3$ and $2\ln_{}{3}$, respectively. (b) is a visualization of a 2-D Poincaré ball.
  • Figure 2: When processing long sequence information, networks based on Transformer differ from those based on Mamba. The Mamba-based method can model the entire sequence and is more efficient in handling long sequence data.
  • Figure 3: Architecture of the SHMamba. First, we employ a pre-trained model to extract features from audio, visual, and question. Next, we simply use the three linear layers as encoders for audio, vision, and problem, respectively. We align audio and visual features by hyperbolic mapping to a hyperbolic space. At the same time, we capture the dynamics of spatio-temporal changes within the video via the Mamba module. Finally the audio and visual features are further interacted and fused through the cross fusion block to predict the answers to the input questions.
  • Figure 4: Ablation Study of curvature value.
  • Figure 5: Ablation study of Mamba modules.
  • ...and 3 more figures