VampNet: Music Generation via Masked Acoustic Token Modeling

Hugo Flores Garcia; Prem Seetharaman; Rithesh Kumar; Bryan Pardo

VampNet: Music Generation via Masked Acoustic Token Modeling

Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, Bryan Pardo

TL;DR

VampNet addresses the need for fast, flexible music generation beyond autoregressive models by using masked acoustic token modeling with parallel iterative decoding. It combines a DAC-based audio tokenizer with two-stage, bidirectional transformers to predict masked token sequences, enabling both compression and creative variation through token-based prompts. The key contributions are the Masked Acoustic Token Modeling framework, a variable masking training schedule, a confidence-based sampling loop, and a suite of prompting strategies including beat-driven and periodic prompts, which can interpolate between faithful reconstruction and generation. Empirical results show that VampNet achieves coherent high-fidelity audio with as few as 36 sampling passes, with beat-driven prompts yielding the best FAD and the approach enabling real-time-like generation relative to autoregressive baselines, suggesting practical applicability for interactive music co-creation.

Abstract

We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation (vamping). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code and audio samples are available online.

VampNet: Music Generation via Masked Acoustic Token Modeling

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 6 figures)

This paper contains 16 sections, 2 equations, 6 figures.

Introduction
Background
Stage 1: Tokenization
Stage 2: Generation
Method
Masked Acoustic Token Modeling
Training procedure
Sampling
Prompting
Experiments
Dataset
Network Architecture and Hyperparameters
Efficiency of VampNet
Effect of prompts
Results and discussion
...and 1 more sections

Figures (6)

Figure 1: VampNet overview. We first convert audio into a sequence of discrete tokens using an audio tokenizer. Tokens are masked, and then passed to a masked generative model, which predicts values for masked tokens via an efficient iterative parallel decoding sampling procedure at two levels. We then decode the result back to audio.
Figure 2: Training, sampling, and prompting VampNet. Training: we train VampNet using Masked Acoustic Token Modeling, where we randomly mask a portion of a set of input acoustic tokens and learn to predict the masked set of tokens, using a variable masking schedule. Coarse model training masks coarse tokens. Coarse-to-fine training only masks fine tokens. Sampling: we sample new sequences of acoustic tokens from VampNet using parallel iterative decoding, where we sample a subset of the most confident predicted tokens each iteration. Prompting: VampNet can be prompted in a number of ways to generate music. For example, it can be prompted periodically, where every $P$th timestep in an input sequence is unmasked, or in a beat-driven fashion, where the timesteps around beat markings in a song are unmasked.
Figure 3: Mel reconstruction error (top) and Fréchet Audio Distance (FAD, bottom) for VampNet samples taken with varying numbers of sampling steps, taken using a periodic prompt of $P=16$. The samples were generated by de-compressing tokens at an extremely low bitrate (50 bps), effectively generating variations of the input signals.
Figure 4: Multiscale Mel-spectrogram error (top) and Fréchet Audio Distance (FAD, bottom) for VampNet 10s samples taken with a different types of prompts.
Figure 5: Mel-spectrogram error (top) and Fréchet Audio Distance (FAD) (bottom) for VampNet samples at varying bitrates. A baseline is provided by replacing tokens in the input sequence with random tokens, per noise ratio $r$.
...and 1 more figures

VampNet: Music Generation via Masked Acoustic Token Modeling

TL;DR

Abstract

VampNet: Music Generation via Masked Acoustic Token Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)