Diff-MST: Differentiable Mixing Style Transfer

Soumya Sai Vanka; Christian Steinmetz; Jean-Baptiste Rolland; Joshua Reiss; George Fazekas

Diff-MST: Differentiable Mixing Style Transfer

Soumya Sai Vanka, Christian Steinmetz, Jean-Baptiste Rolland, Joshua Reiss, George Fazekas

TL;DR

Diff-MST tackles the challenge of scalable, interpretable, and controllable mixing style transfer for multitrack audio by introducing a differentiable mixing console coupled with a transformer-based controller and an audio production style loss. The method estimates per-track mixing parameters $P$ via $P = g_\phi(f_{\theta t}(T), f_{\theta r}(M_r))$ and renders a mix $M_p$ through $M_p = h(T, P)$, while employing a spectrogram-based encoder and a differentiable signal-chain to preserve audio quality. Training relies on two self-supervised regimes and two loss families—AF and MRSTFT—to capture dynamics, spatialization, and spectral attributes, with objective evaluations showing superiority over baselines, especially when trained with real-world references and AF loss. The framework scales to any number of input tracks, avoids artifacts by operating on a parameter-estimation paradigm grounded in traditional effects, and demonstrates practical potential for DAW workflows, though subjective listening tests and richer context modeling remain for future work.

Abstract

Mixing style transfer automates the generation of a multitrack mix for a given set of tracks by inferring production attributes from a reference song. However, existing systems for mixing style transfer are limited in that they often operate only on a fixed number of tracks, introduce artifacts, and produce mixes in an end-to-end fashion, without grounding in traditional audio effects, prohibiting interpretability and controllability. To overcome these challenges, we introduce Diff-MST, a framework comprising a differentiable mixing console, a transformer controller, and an audio production style loss function. By inputting raw tracks and a reference song, our model estimates control parameters for audio effects within a differentiable mixing console, producing high-quality mixes and enabling post-hoc adjustments. Moreover, our architecture supports an arbitrary number of input tracks without source labelling, enabling real-world applications. We evaluate our model's performance against robust baselines and showcase the effectiveness of our approach, architectural design, tailored audio production style loss, and innovative training methodology for the given task.

Diff-MST: Differentiable Mixing Style Transfer

TL;DR

via

and renders a mix

through

, while employing a spectrogram-based encoder and a differentiable signal-chain to preserve audio quality. Training relies on two self-supervised regimes and two loss families—AF and MRSTFT—to capture dynamics, spatialization, and spectral attributes, with objective evaluations showing superiority over baselines, especially when trained with real-world references and AF loss. The framework scales to any number of input tracks, avoids artifacts by operating on a parameter-estimation paradigm grounded in traditional effects, and demonstrates practical potential for DAW workflows, though subjective listening tests and richer context modeling remain for future work.

Abstract

Paper Structure (18 sections, 8 equations, 5 figures, 2 tables)

This paper contains 18 sections, 8 equations, 5 figures, 2 tables.

Introduction
Mixing Style Transfer
Method
Problem Formulation
Differentiable Mixing Style Transfer System
Differentiable Mixing Console (DMC)
Spectrogram Encoder
Transformer Controller
Audio Production Style Loss
Experiment Design
Datasets
Training Details
Baselines
Objective Evaluation
Discussion
...and 3 more sections

Figures (5)

Figure 1: Diff-MST, a differentiable mixing style transfer framework featuring a differentiable multitrack mixing console, a transformer-based controller that estimates control parameters for this mixing console, and an audio production style loss function that measures the similarity between the estimated mix and reference mixes.
Figure 2: Formulations for deep learning-based automatic mixing systems steinmetz2022automix. (a) Direct transformation (b) Parameter estimation on parameter loss (c) Parameter estimation on audio loss. Here, $x_i$ for $i \in [1, N]$ are the $N$ input tracks, $f_\theta$ is the transformation, $h$ is the dedicated mixing console, $Y$ and $\hat{Y}$ are the ground truth and predicted mix, $P$ and $\hat{P}$ are the ground truth and predicted control parameters and $L_a$ and $L_p$ are the audio and parameter loss respectively.
Figure 3: Differentiable Mixing console
Figure 4: First training strategy from Section \ref{['sec:method_1']}.
Figure :

Diff-MST: Differentiable Mixing Style Transfer

TL;DR

Abstract

Diff-MST: Differentiable Mixing Style Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)