Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Javier Nistal; Marco Pasini; Cyran Aouameur; Maarten Grachten; Stefan Lattner

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, Stefan Lattner

TL;DR

This work introduces Diff-A-Riff, a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.

Abstract

Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce "Diff-A-Riff," a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model's capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website: sonycslparis.github.io/diffariff-companion/

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

TL;DR

Abstract

Paper Structure (19 sections, 3 figures, 3 tables)

This paper contains 19 sections, 3 figures, 3 tables.

Introduction
Related Work
Background
Methodology
Dataset
Diff-A-Riff
Consistency Autoencoder
Latent Diffusion Model
Training
Evaluation
Inference Configurations
Objective Metrics
Listening Tests
Results & Discussion
Objective Evaluation
...and 4 more sections

Figures (3)

Figure 1: Overview of Diff-A-Riff. Left: The CAE Encoder transforms the music context into a compressed representation, concatenated with a noisy sample, and further processed through a multi-scale U-Net. At each scale, conditional CLAP and time-step embeddings are integrated through a feature-wise linear transformation. The generated latent sequence is decoded via the CAE Decoder. We highlight frozen components in blue and trainable elements in orange. Text prompting is only used at inference. Right: The encoder architecture comprises four down-sampling blocks with four convolutional and group norm layers with skip connections. The decoder mirrors this architecture.
Figure 2: MMD2 as a function of the number of denoising steps $T$ for various conditional settings (see Sec. \ref{['sec:obj_eval_res']}).
Figure :

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

TL;DR

Abstract

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)