Table of Contents
Fetching ...

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

Khushiyant, Param Thakkar

Abstract

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.

HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

Abstract

Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.
Paper Structure (28 sections, 6 equations, 7 figures, 5 tables)

This paper contains 28 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of design choice interactions. Left: The raw-vs-spectrogram accuracy gap (positive = raw better) depends on backbone. Pure Mamba prefers raw on ESC-50 but not Speech Commands; Pure Attention strongly prefers spectrograms. Center: All six Speech Commands variants. Right: As sequence length scales from 3k to 30k tokens, the HELIX advantage grows from 0 to 11.5 points; Pure Attention cannot run at 30k tokens.
  • Figure 2: ESC-50 results. Solid bars: raw waveform; hatched bars: spectrogram. Error bars show $\pm$1 std across 5 folds. Pure Mamba dominates; input preference flips between backbones.
  • Figure 3: Speech Commands v2 results. Epoch counts shown above bars (runs that hit compute limits did not reach 100 epochs). Pure Attention raw collapses; HELIX raw leads despite fewer epochs.
  • Figure 4: ESC-50 training curves (fold 1). Raw waveform models (left) learn slower but Pure Mamba eventually surpasses all spectrogram variants (right). Smoothed with a 7-epoch moving average; raw values shown underneath.
  • Figure 5: Speech Commands training curves. Left: On raw waveforms, Pure Mamba raw destabilizes after epoch 30 and never recovers, while HELIX raw climbs steadily to 92.9%. Pure Attention raw plateaus at ${\sim}82$%. Right: Spectrograms stabilize all architectures; the gap between variants shrinks to ${\sim}1$%.
  • ...and 2 more figures