MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms
Eleonora Ristori, Luca Bindini, Paolo Frasconi
TL;DR
MARS reframes spectrogram-based audio generation by treating spectrograms as multi-channel images and applying next-scale autoregression with a shared tokenizer across resolutions. The core innovations are channel multiplexing (CMX), which reduces spatial resolution without data loss, and a transformer-based autoregressor that refines spectrograms from coarse to fine scales. A cross-scale tokenizer trained with a composite loss ensures consistent discrete representations, enabling efficient hierarchical generation. On NSynth, MARS achieves competitive or superior performance across multiple reconstruction, diversity, and perceptual metrics, while maintaining favorable compute and memory characteristics, suggesting a scalable path to high-fidelity audio synthesis.
Abstract
Research on audio generation has progressively shifted from waveform-based approaches to spectrogram-based methods, which more naturally capture harmonic and temporal structures. At the same time, advances in image synthesis have shown that autoregression across scales, rather than tokens, improves coherence and detail. Building on these ideas, we introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images and employs channel multiplexing (CMX), a reshaping technique that lowers height and width without discarding information. A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms from coarse to fine resolutions efficiently. Experiments on a large-scale dataset demonstrate that MARS performs comparably or better than state-of-the-art baselines across multiple evaluation metrics, establishing an efficient and scalable paradigm for high-fidelity audio generation.
