SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Marco Comunità; Zhi Zhong; Akira Takahashi; Shiqi Yang; Mengjie Zhao; Koichi Saito; Yukara Ikemiya; Takashi Shibuya; Shusuke Takahashi; Yuki Mitsufuji

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

TL;DR

SpecMaskGIT addresses the inefficiency of state-of-the-art text-to-audio synthesis by introducing a masked, spectrogram-based generative model. It tokenizes Mel-spectrograms with a SpecVQGAN, trains a bidirectional Transformer to reconstruct randomly masked discrete tokens conditioned on CLAP embeddings, and uses classifier-free guidance with a cosine mask schedule for fast, high-quality synthesis. The approach achieves realistic 10-second audio with fewer than 16 iterations and real-time CPU performance, while enabling zero-shot tasks such as bandwidth extension and time inpainting. By framing text-conditioned audio synthesis as a generative extension of masked spectrogram modeling, it highlights the potential of masked audio modeling for a range of downstream applications and tasks.

Abstract

Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios.

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 8 figures, 8 tables)

This paper contains 12 sections, 3 equations, 8 figures, 8 tables.

Introduction
Related Works
SpecMaskGIT
Spectrogram Tokenizer and Vocoder
Masked Generative Modeling of Spectrograms
Text Conditioning via Sequential Modeling
Iterative Synthesis with Classifier-free Guidance
Experiments
Results
Text-to-audio Synthesis
Downstream Inpainting, BWE and Tagging Tasks
Conclusion

Figures (8)

Figure 1: Audio synthesis performance and number of synthesis iterations of different methods. The size of circle represents the model size. SpecMaskGIT achieves decent quality with only 16 iterations and a small model size.
Figure 2: Real-time factor of SpecMaskGIT on different Xeon CPU cores with standard Python implementation.
Figure 3: SpecVQGAN, which encodes non-overlapping 16-by-16 time-mel patches into discrete tokens, and decodes the discrete tokens back to Mel-spectrogram.
Figure 4: Self-supervised training of SpecMaskGIT. The Transformer is trained to reconstruct SpecVQGAN token sequences that are randomly masked with variable masking ratios, conditioned by a semantic embeddding from the CLAP encoder. "M" denotes the learned mask token, while "C" denotes the proposed conditional mask.
Figure 5: The iterative text-to-audio synthesis in SpecMaskGIT.
...and 3 more figures

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

TL;DR

Abstract

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Authors

TL;DR

Abstract

Table of Contents

Figures (8)