Table of Contents
Fetching ...

MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms

Eleonora Ristori, Luca Bindini, Paolo Frasconi

TL;DR

MARS reframes spectrogram-based audio generation by treating spectrograms as multi-channel images and applying next-scale autoregression with a shared tokenizer across resolutions. The core innovations are channel multiplexing (CMX), which reduces spatial resolution without data loss, and a transformer-based autoregressor that refines spectrograms from coarse to fine scales. A cross-scale tokenizer trained with a composite loss ensures consistent discrete representations, enabling efficient hierarchical generation. On NSynth, MARS achieves competitive or superior performance across multiple reconstruction, diversity, and perceptual metrics, while maintaining favorable compute and memory characteristics, suggesting a scalable path to high-fidelity audio synthesis.

Abstract

Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.

MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms

TL;DR

MARS reframes spectrogram-based audio generation by treating spectrograms as multi-channel images and applying next-scale autoregression with a shared tokenizer across resolutions. The core innovations are channel multiplexing (CMX), which reduces spatial resolution without data loss, and a transformer-based autoregressor that refines spectrograms from coarse to fine scales. A cross-scale tokenizer trained with a composite loss ensures consistent discrete representations, enabling efficient hierarchical generation. On NSynth, MARS achieves competitive or superior performance across multiple reconstruction, diversity, and perceptual metrics, while maintaining favorable compute and memory characteristics, suggesting a scalable path to high-fidelity audio synthesis.

Abstract

Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.

Paper Structure

This paper contains 11 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Audio preprocessing pipeline for tokenizer input preparation and channel multiplexing (CMX) for reducing input resolution.
  • Figure 2: Tokenizer architecture. The tokenizer is adapted from ImageFolder DBLP:conf/iclr/0106Q0KGRL25, improving upon the original VAR tokenizer tian2024visual. The input spectrogram is partitioned into patches of size $L \times L$ and concatenated with $S$ learnable tokens before being processed by a transformer encoder $\mathcal{E}$, producing latent representations $z$. These are discretized by a vector quantizer to obtain $z'$, which are then combined with another set of $L \times L$ learnable tokens and passed to a decoder $\mathcal{D}$ for reconstruction.