Table of Contents
Fetching ...

Efficient Parallel Audio Generation using Group Masked Language Modeling

Myeonghun Jeong, Minchan Kim, Joun Yeop Lee, Nam Soo Kim

TL;DR

Group-Masked Language Modeling (G-MLM) and Group Iterative Parallel Decoding (G-IPD) are proposed for efficient parallel audio generation and experimental results demonstrate that the proposed model outperforms the baselines in prompt-based audio generation.

Abstract

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.

Efficient Parallel Audio Generation using Group Masked Language Modeling

TL;DR

Group-Masked Language Modeling (G-MLM) and Group Iterative Parallel Decoding (G-IPD) are proposed for efficient parallel audio generation and experimental results demonstrate that the proposed model outperforms the baselines in prompt-based audio generation.

Abstract

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.
Paper Structure (17 sections, 2 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 2 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our proposed model
  • Figure 2: Bi-group, bi-depth G-RVQ for acoustic tokenization
  • Figure 3: Overall model architecture
  • Figure 4: Comparison of iterative inference process: (a) SoundStorm's IPD, and (b) proposed method's G-IPD technique. $s$ denotes the iteration steps.
  • Figure 5: Comparison of inference speed. The prompt semantic tokenization is only used in SoundStorm's sampling process, and presented SoundStorm's runtime is evaluated without prompt semantic tokenization