Table of Contents
Fetching ...

SAM: A Mamba-2 State-Space Audio-Language Model

Taehan Lee, Jaehan Jung, Hyukjun Lee

TL;DR

This work provides the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs, establishing practical design principles for SSMs as strong, scalable backbones for audio-language models.

Abstract

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

SAM: A Mamba-2 State-Space Audio-Language Model

TL;DR

This work provides the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs, establishing practical design principles for SSMs as strong, scalable backbones for audio-language models.

Abstract

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overall architecture of our SSM-based Audio-language Model (SAM).
  • Figure 2: $\tau$-effective rank across training stages by model size.
  • Figure 3: State update distance between adjacent audio tokens.