Table of Contents
Fetching ...

Automatic Music Mixing using a Generative Model of Effect Embeddings

Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Vesa Välimäki, Yuki Mitsufuji

TL;DR

MEGAMI addresses the subjectivity of automatic music mixing by modeling the conditional distribution $p(\hat{\mathcal{Y}} \,|\, \mathcal{X})$ of processed tracks using a conditional diffusion model in an effect-embedding space. It introduces a Multitrack Effect Embedding Generator that produces per-track latent embeddings conditioned on CLAP-derived content features, with a permutation-equivariant transformer and domain adaptation to enable training on wet-only data. The Effect Processor applies the generated embeddings through a track-agnostic network, guided by multi-scale spectral and cosine losses to preserve effect characteristics. Across internal and public benchmarks, MEGAMI outperforms baselines on distributional objective metrics and approaches human-level quality in subjective tests, demonstrating its potential for scalable, flexible automatic mixing across diverse genres.

Abstract

Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.

Automatic Music Mixing using a Generative Model of Effect Embeddings

TL;DR

MEGAMI addresses the subjectivity of automatic music mixing by modeling the conditional distribution of processed tracks using a conditional diffusion model in an effect-embedding space. It introduces a Multitrack Effect Embedding Generator that produces per-track latent embeddings conditioned on CLAP-derived content features, with a permutation-equivariant transformer and domain adaptation to enable training on wet-only data. The Effect Processor applies the generated embeddings through a track-agnostic network, guided by multi-scale spectral and cosine losses to preserve effect characteristics. Across internal and public benchmarks, MEGAMI outperforms baselines on distributional objective metrics and approaches human-level quality in subjective tests, demonstrating its potential for scalable, flexible automatic mixing across diverse genres.

Abstract

Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Diagram of the proposed MEGAMI system.
  • Figure 2: Boxplots of subjective listening test scores for each song individually and for all songs combined, showing that MEGAMI approaches the quality of a human mixing engineer and, in some cases, exceeds it.