Table of Contents
Fetching ...

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky, Ambuj Mehrish, Abhinaba Roy, Dorien Herremans

TL;DR

SonicMaster addresses the problem of restoring and mastering music with a single, controllable model capable of handling multiple artifact types. It introduces a rectified-flow, text-conditioned framework that operates in a latent space learned by a diffusion-based VAE, guided by natural language prompts to perform dereverberation, declipping, EQ, dynamics, and stereo enhancement in one step. A large text-conditioned music-restoration dataset (25k clips, 175k degraded pairs) enables joint learning of 19 degradations across five categories, and a pooling pathway supports long-form generation. Objective metrics and listening tests show SonicMaster outperforms baselines and approaches specialized restoration methods, with strong generalization to historical piano recordings, highlighting its potential as a generalist restoration tool driven by intuitive prompts.

Abstract

Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over other baselines.

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

TL;DR

SonicMaster addresses the problem of restoring and mastering music with a single, controllable model capable of handling multiple artifact types. It introduces a rectified-flow, text-conditioned framework that operates in a latent space learned by a diffusion-based VAE, guided by natural language prompts to perform dereverberation, declipping, EQ, dynamics, and stereo enhancement in one step. A large text-conditioned music-restoration dataset (25k clips, 175k degraded pairs) enables joint learning of 19 degradations across five categories, and a pooling pathway supports long-form generation. Objective metrics and listening tests show SonicMaster outperforms baselines and approaches specialized restoration methods, with strong generalization to historical piano recordings, highlighting its potential as a generalist restoration tool driven by intuitive prompts.

Abstract

Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over other baselines.

Paper Structure

This paper contains 30 sections, 19 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: SonicMaster dataset creation pipeline and overview.
  • Figure 2: Overall architecture of SonicMaster.
  • Figure 3: Comparison of SI-SDR scores ($\uparrow$) for Dynamics and Reverb removal.
  • Figure 4: Listening study - SonicMaster's performance on specific degradations – MOS 95% CI
  • Figure 5: Comparative Listening Study Results ($N=20$ participants $\times$ 10 samples per category).
  • ...and 6 more figures