Combining audio control and style transfer using latent diffusion

Nils Demerlé; Philippe Esling; Guillaume Doras; David Genova

Combining audio control and style transfer using latent diffusion

Nils Demerlé, Philippe Esling, Guillaume Doras, David Genova

TL;DR

The paper tackles the challenge of enabling both explicit control and style transfer in audio generation by disentangling structure and timbre into separate latent representations conditioned via a latent diffusion model. It introduces an invertible audio codec and two semantic encoders (structure and timbre), trained with a two-stage adversarial disentanglement to ensure independent control over content and style. The method demonstrates strong performance on MIDI-to-audio rendering, one-shot timbre transfer, and complete music style transfer, outperforming baselines in audio quality and target fidelity. This unified approach offers a practical pathway for artists to combine precise structural control with flexible timbre/style transfer in realistic musical contexts.

Abstract

Deep generative models are now able to synthesize high-quality audio signals, shifting the critical aspect in their development from audio quality to control capabilities. Although text-to-music generation is getting largely adopted by the general public, explicit control and example-based style transfer are more adequate modalities to capture the intents of artists and musicians. In this paper, we aim to unify explicit control and style transfer within a single model by separating local and global information to capture musical structure and timbre respectively. To do so, we leverage the capabilities of diffusion autoencoders to extract semantic features, in order to build two representation spaces. We enforce disentanglement between those spaces using an adversarial criterion and a two-stage training strategy. Our resulting model can generate audio matching a timbre target, while specifying structure either with explicit controls or through another audio example. We evaluate our model on one-shot timbre transfer and MIDI-to-audio tasks on instrumental recordings and show that we outperform existing baselines in terms of audio quality and target fidelity. Furthermore, we show that our method can generate cover versions of complete musical pieces by transferring rhythmic and melodic content to the style of a target audio in a different genre.

Combining audio control and style transfer using latent diffusion

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 3 figures, 3 tables)

This paper contains 20 sections, 6 equations, 3 figures, 3 tables.

Introduction
Background
Diffusion models
Diffusion autoencoders
Control in audio generation
Unsupervised disentanglement in sequential data
Method
Audio codec
Model structure
Style and content disentanglement
Experiments
Dataset
Evaluation metrics
Baselines
Training details
...and 5 more sections

Figures (3)

Figure 1: General overview of our method. We extract timbre and structure representations from waveform and/or MIDI inputs using encoders $E_T$ ad $E_S$ respectively. Those representations condition a latent diffusion model, enabling both explicit and example-based control.
Figure 2: Detailed overview of our method. Input signal(s) are passed to structure and timbre encoders, which provides semantic encodings that are further disentangled through confusion maximization. These are used to condition a latent diffusion model to generate the output signal. Input signals are identical during training and but distinct at inference.
Figure :

Combining audio control and style transfer using latent diffusion

TL;DR

Abstract

Combining audio control and style transfer using latent diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)