Table of Contents
Fetching ...

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre Défossez

TL;DR

This work enables audio-based conditioning for language-model–driven music generation by two paths: textual inversion to map audio into a pretrained text-to-music embedding and a jointly trained style conditioner that ingests short audio clips through a bottlenecked RVQ-based encoder. It introduces a dual conditioning framework with a novel double classifier free guidance to balance text and audio influences, and proposes objective style metrics plus human evaluations to quantify style transfer without copying. Experiments on licensed and internal datasets show that a style encoder using EnCodec with two quantization levels offers a practical balance between fidelity and stylistic conformity, outperforming baselines that rely solely on textual or CLAP-based conditioning. The approach provides a flexible, controllable method for artist-like stylistic generation and suggests broader applicability to multimodal conditioning in generative models, while addressing ethical concerns about copying and data use.

Abstract

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

Audio Conditioning for Music Generation via Discrete Bottleneck Features

TL;DR

This work enables audio-based conditioning for language-model–driven music generation by two paths: textual inversion to map audio into a pretrained text-to-music embedding and a jointly trained style conditioner that ingests short audio clips through a bottlenecked RVQ-based encoder. It introduces a dual conditioning framework with a novel double classifier free guidance to balance text and audio influences, and proposes objective style metrics plus human evaluations to quantify style transfer without copying. Experiments on licensed and internal datasets show that a style encoder using EnCodec with two quantization levels offers a practical balance between fidelity and stylistic conformity, outperforming baselines that rely solely on textual or CLAP-based conditioning. The approach provides a flexible, controllable method for artist-like stylistic generation and suggests broader applicability to multimodal conditioning in generative models, while addressing ethical concerns about copying and data use.

Abstract

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.
Paper Structure (20 sections, 6 equations, 3 figures, 4 tables)

This paper contains 20 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An overview of the Texual Inversion method based on the pretrained text-to-music MusicGen
  • Figure 2: An overview of the general architecture. Text conditioning and style conditioning are provided to the model as a prefix. On the right we present the style conditioner.
  • Figure :