Table of Contents
Fetching ...

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

TL;DR

The paper introduces the Cinematic Demixing Track of SDX'23 and its hidden test set CDXDB23 to benchmark dialogue, sound effects, and music separation in real movie audio. It shows that data realism and targeted preprocessing (e.g., vocal-removal, dialogue-first cascades) significantly boost SDR, with up to 5.7 dB gains when extra data are allowed. Key contributions include the CDXDB23 dataset, the dual-leaderboard framework separating data- and algorithm-driven gains, and an analysis of distribution mismatches between synthetic DnR and real cinematic audio, along with strategies to mitigate them. The findings highlight practical pathways to robust cinematic separation, such as data augmentation, model ensembles, and normalization that align synthetic training data with real-world film soundtracks.

Abstract

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

TL;DR

The paper introduces the Cinematic Demixing Track of SDX'23 and its hidden test set CDXDB23 to benchmark dialogue, sound effects, and music separation in real movie audio. It shows that data realism and targeted preprocessing (e.g., vocal-removal, dialogue-first cascades) significantly boost SDR, with up to 5.7 dB gains when extra data are allowed. Key contributions include the CDXDB23 dataset, the dual-leaderboard framework separating data- and algorithm-driven gains, and an analysis of distribution mismatches between synthetic DnR and real cinematic audio, along with strategies to mitigate them. The findings highlight practical pathways to robust cinematic separation, such as data augmentation, model ensembles, and normalization that align synthetic training data with real-world film soundtracks.

Abstract

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.
Paper Structure (27 sections, 8 equations, 9 figures, 7 tables)

This paper contains 27 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Statistics of movies in CDXDB23.
  • Figure 2: Performance of submissions on full CDXDB23 over time.
  • Figure 3: Analysis of overfitting of global SDR. The $y$-axis shows the difference between global SDR on the hidden test set and global SDR displayed to the participants (trajectories with negative slope indicate overfitting).
  • Figure 4: Comparison of the cocktail-fork baseline with winning submissions on both leaderboards for individual movies. For movie "000", we only have one clip and, hence, the box plot collapses to a horizontal line. Circles represent outliers that are outside the whiskers of the boxplot.
  • Figure 5: SDR dependencies on the input volume in LUFS for music, dialogue, and effects. A solid line shows SDR values on RED; crosses mark SDR on CDXDB23. Horizontal dashed and dotted lines show SDR for models without converting the volume of the input signal. The MRX model is blue, MRX-C is orange, MRX-C with a Wiener filter is green, and MRX-C with post-processing scaling is red. In the case of testing MRX-C scaling on the CDXDB23, the SDR values are only available for effects (Team mp3d).
  • ...and 4 more figures