Table of Contents
Fetching ...

Diff-MST: Differentiable Mixing Style Transfer

Soumya Sai Vanka, Christian Steinmetz, Jean-Baptiste Rolland, Joshua Reiss, George Fazekas

TL;DR

Diff-MST tackles the challenge of scalable, interpretable, and controllable mixing style transfer for multitrack audio by introducing a differentiable mixing console coupled with a transformer-based controller and an audio production style loss. The method estimates per-track mixing parameters $P$ via $P = g_\phi(f_{\theta t}(T), f_{\theta r}(M_r))$ and renders a mix $M_p$ through $M_p = h(T, P)$, while employing a spectrogram-based encoder and a differentiable signal-chain to preserve audio quality. Training relies on two self-supervised regimes and two loss families—AF and MRSTFT—to capture dynamics, spatialization, and spectral attributes, with objective evaluations showing superiority over baselines, especially when trained with real-world references and AF loss. The framework scales to any number of input tracks, avoids artifacts by operating on a parameter-estimation paradigm grounded in traditional effects, and demonstrates practical potential for DAW workflows, though subjective listening tests and richer context modeling remain for future work.

Abstract

Mixing style transfer automates the generation of a multitrack mix for a given set of tracks by inferring production attributes from a reference song. However, existing systems for mixing style transfer are limited in that they often operate only on a fixed number of tracks, introduce artifacts, and produce mixes in an end-to-end fashion, without grounding in traditional audio effects, prohibiting interpretability and controllability. To overcome these challenges, we introduce Diff-MST, a framework comprising a differentiable mixing console, a transformer controller, and an audio production style loss function. By inputting raw tracks and a reference song, our model estimates control parameters for audio effects within a differentiable mixing console, producing high-quality mixes and enabling post-hoc adjustments. Moreover, our architecture supports an arbitrary number of input tracks without source labelling, enabling real-world applications. We evaluate our model's performance against robust baselines and showcase the effectiveness of our approach, architectural design, tailored audio production style loss, and innovative training methodology for the given task.

Diff-MST: Differentiable Mixing Style Transfer

TL;DR

Diff-MST tackles the challenge of scalable, interpretable, and controllable mixing style transfer for multitrack audio by introducing a differentiable mixing console coupled with a transformer-based controller and an audio production style loss. The method estimates per-track mixing parameters via and renders a mix through , while employing a spectrogram-based encoder and a differentiable signal-chain to preserve audio quality. Training relies on two self-supervised regimes and two loss families—AF and MRSTFT—to capture dynamics, spatialization, and spectral attributes, with objective evaluations showing superiority over baselines, especially when trained with real-world references and AF loss. The framework scales to any number of input tracks, avoids artifacts by operating on a parameter-estimation paradigm grounded in traditional effects, and demonstrates practical potential for DAW workflows, though subjective listening tests and richer context modeling remain for future work.

Abstract

Mixing style transfer automates the generation of a multitrack mix for a given set of tracks by inferring production attributes from a reference song. However, existing systems for mixing style transfer are limited in that they often operate only on a fixed number of tracks, introduce artifacts, and produce mixes in an end-to-end fashion, without grounding in traditional audio effects, prohibiting interpretability and controllability. To overcome these challenges, we introduce Diff-MST, a framework comprising a differentiable mixing console, a transformer controller, and an audio production style loss function. By inputting raw tracks and a reference song, our model estimates control parameters for audio effects within a differentiable mixing console, producing high-quality mixes and enabling post-hoc adjustments. Moreover, our architecture supports an arbitrary number of input tracks without source labelling, enabling real-world applications. We evaluate our model's performance against robust baselines and showcase the effectiveness of our approach, architectural design, tailored audio production style loss, and innovative training methodology for the given task.
Paper Structure (18 sections, 8 equations, 5 figures, 2 tables)

This paper contains 18 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Diff-MST, a differentiable mixing style transfer framework featuring a differentiable multitrack mixing console, a transformer-based controller that estimates control parameters for this mixing console, and an audio production style loss function that measures the similarity between the estimated mix and reference mixes.
  • Figure 2: Formulations for deep learning-based automatic mixing systems steinmetz2022automix. (a) Direct transformation (b) Parameter estimation on parameter loss (c) Parameter estimation on audio loss. Here, $x_i$ for $i \in [1, N]$ are the $N$ input tracks, $f_\theta$ is the transformation, $h$ is the dedicated mixing console, $Y$ and $\hat{Y}$ are the ground truth and predicted mix, $P$ and $\hat{P}$ are the ground truth and predicted control parameters and $L_a$ and $L_p$ are the audio and parameter loss respectively.
  • Figure 3: Differentiable Mixing console
  • Figure 4: First training strategy from Section \ref{['sec:method_1']}.
  • Figure :