Table of Contents
Fetching ...

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife

TL;DR

DnR v3 advances cinematic audio source separation by embedding multilingual dialogue into the DX stem, removing vocal content from MX and FX, and aligning loudness and mastering with industry practices to better reflect real-world cinematic production. The dataset retains a synthetic framework but expands language coverage (>30 languages) and refines the generation pipeline with normalized loudness, dense non-dialogue backdrops, and Netflix-style mastering, releasing under CC BY-SA 4.0 with replication code under Apache 2.0. Benchmark results using the Bandit model show that multilingual training often matches or surpasses monolingual performance across languages, including low-resource scenarios, demonstrating strong cross-lingual generalization. This work provides a practical, license-cleared resource for advancing multilingual CASS research and highlights the value of cross-language training for robust source separation in cinema-like audio.

Abstract

Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

TL;DR

DnR v3 advances cinematic audio source separation by embedding multilingual dialogue into the DX stem, removing vocal content from MX and FX, and aligning loudness and mastering with industry practices to better reflect real-world cinematic production. The dataset retains a synthetic framework but expands language coverage (>30 languages) and refines the generation pipeline with normalized loudness, dense non-dialogue backdrops, and Netflix-style mastering, releasing under CC BY-SA 4.0 with replication code under Apache 2.0. Benchmark results using the Bandit model show that multilingual training often matches or surpasses monolingual performance across languages, including low-resource scenarios, demonstrating strong cross-lingual generalization. This work provides a practical, license-cleared resource for advancing multilingual CASS research and highlights the value of cross-language training for robust source separation in cinema-like audio.

Abstract

Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.
Paper Structure (20 sections, 1 figure, 4 tables, 2 algorithms)

This paper contains 20 sections, 1 figure, 4 tables, 2 algorithms.

Figures (1)

  • Figure 1: Test set distribution of the post-mastering loudness, true peak, event counts, and event durations for each stem. Each colored line represents a variant. The dotted line in the mixture loudness plot indicates the -27LKFS level. The dotted lines in the peak plots indicate -2dBFS and 0dBFS levels.