Efficient Parallel Audio Generation using Group Masked Language Modeling

Myeonghun Jeong; Minchan Kim; Joun Yeop Lee; Nam Soo Kim

Efficient Parallel Audio Generation using Group Masked Language Modeling

Myeonghun Jeong, Minchan Kim, Joun Yeop Lee, Nam Soo Kim

TL;DR

Group-Masked Language Modeling (G-MLM) and Group Iterative Parallel Decoding (G-IPD) are proposed for efficient parallel audio generation and experimental results demonstrate that the proposed model outperforms the baselines in prompt-based audio generation.

Abstract

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.

Efficient Parallel Audio Generation using Group Masked Language Modeling

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 2 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Backgrounds
Group Residual Vector Quantization (G-RVQ)
SoundStorm
Proposed method
Tokenization
Model Architecture
Training and Inference
Experiments
Experimental setup
Implementation details
Baselines
Evaluation metrics
Results and Analysis
Prompt-based audio generation
...and 2 more sections

Figures (5)

Figure 1: Overview of our proposed model
Figure 2: Bi-group, bi-depth G-RVQ for acoustic tokenization
Figure 3: Overall model architecture
Figure 4: Comparison of iterative inference process: (a) SoundStorm's IPD, and (b) proposed method's G-IPD technique. $s$ denotes the iteration steps.
Figure 5: Comparison of inference speed. The prompt semantic tokenization is only used in SoundStorm's sampling process, and presented SoundStorm's runtime is evaluated without prompt semantic tokenization

Efficient Parallel Audio Generation using Group Masked Language Modeling

TL;DR

Abstract

Efficient Parallel Audio Generation using Group Masked Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)