SAM: A Mamba-2 State-Space Audio-Language Model

Taehan Lee; Jaehan Jung; Hyukjun Lee

SAM: A Mamba-2 State-Space Audio-Language Model

Taehan Lee, Jaehan Jung, Hyukjun Lee

TL;DR

This work provides the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs, establishing practical design principles for SSMs as strong, scalable backbones for audio-language models.

Abstract

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

SAM: A Mamba-2 State-Space Audio-Language Model

TL;DR

Abstract

SAM: A Mamba-2 State-Space Audio-Language Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)