Table of Contents
Fetching ...

Source Separation of Small Classical Ensembles: Challenges and Opportunities

Gerardo Roa-Dabike, Trevor J. Cox, Jon P. Barker, Michael A. Akeroyd, Scott Bannister, Bruno Fazenda, Jennifer Firth, Simone Graetzer, Alinka Greasley, Rebecca R. Vos, William M. Whitmer

TL;DR

The paper addresses the challenge of separating instruments in small classical ensembles for hearing-aid remixing, a problem exacerbated by limited labeled data and complex acoustics. It adopts eight ConvTasNet-based models trained on two synthetic datasets (EnsembleSet for strings and a newly created CadenzaWoodwind for woodwinds) and evaluates causal versus non-causal processing on Bach10 and URMP, revealing a substantial generalization gap to real recordings. Key contributions include the introduction of the CadenzaWoodwind dataset, a baseline MSS framework for classical ensembles, and an analysis of the gap between synthetic validation performance ($6.2$–$6.9$ dB SDR) and real-world results ($0.12$–$0.58$ dB SDR). The results underscore the difficulty of translating synthetic training to real performances and highlight concrete directions for data diversification, realism, and benchmarks to advance hearing-aid friendly MSS for classical music.

Abstract

Musical (MSS) source separation of western popular music using non-causal deep learning can be very effective. In contrast, MSS for classical music is an unsolved problem. Classical ensembles are harder to separate than popular music because of issues such as the inherent greater variation in the music; the sparsity of recordings with ground truth for supervised training; and greater ambiguity between instruments. The Cadenza project has been exploring MSS for classical music. This is being done so music can be remixed to improve listening experiences for people with hearing loss. To enable the work, a new database of synthesized woodwind ensembles was created to overcome instrumental imbalances in the EnsembleSet. For the MSS, a set of ConvTasNet models was used with each model being trained to extract a string or woodwind instrument. ConvTasNet was chosen because it enabled both causal and non-causal approaches to be tested. Non-causal approaches have dominated MSS work and are useful for recorded music, but for live music or processing on hearing aids, causal signal processing is needed. The MSS performance was evaluated on the two small datasets (Bach10 and URMP) of real instrument recordings where the ground-truth is available. The performances of the causal and non-causal systems were similar. Comparing the average Signal-to-Distortion (SDR) of the synthesized validation set (6.2 dB causal; 6.9 non-causal), to the real recorded evaluation set (0.3 dB causal, 0.4 dB non-causal), shows that mismatch between synthesized and recorded data is a problem. Future work needs to either gather more real recordings that can be used for training, or to improve the realism and diversity of the synthesized recordings to reduce the mismatch...

Source Separation of Small Classical Ensembles: Challenges and Opportunities

TL;DR

The paper addresses the challenge of separating instruments in small classical ensembles for hearing-aid remixing, a problem exacerbated by limited labeled data and complex acoustics. It adopts eight ConvTasNet-based models trained on two synthetic datasets (EnsembleSet for strings and a newly created CadenzaWoodwind for woodwinds) and evaluates causal versus non-causal processing on Bach10 and URMP, revealing a substantial generalization gap to real recordings. Key contributions include the introduction of the CadenzaWoodwind dataset, a baseline MSS framework for classical ensembles, and an analysis of the gap between synthetic validation performance ( dB SDR) and real-world results ( dB SDR). The results underscore the difficulty of translating synthetic training to real performances and highlight concrete directions for data diversification, realism, and benchmarks to advance hearing-aid friendly MSS for classical music.

Abstract

Musical (MSS) source separation of western popular music using non-causal deep learning can be very effective. In contrast, MSS for classical music is an unsolved problem. Classical ensembles are harder to separate than popular music because of issues such as the inherent greater variation in the music; the sparsity of recordings with ground truth for supervised training; and greater ambiguity between instruments. The Cadenza project has been exploring MSS for classical music. This is being done so music can be remixed to improve listening experiences for people with hearing loss. To enable the work, a new database of synthesized woodwind ensembles was created to overcome instrumental imbalances in the EnsembleSet. For the MSS, a set of ConvTasNet models was used with each model being trained to extract a string or woodwind instrument. ConvTasNet was chosen because it enabled both causal and non-causal approaches to be tested. Non-causal approaches have dominated MSS work and are useful for recorded music, but for live music or processing on hearing aids, causal signal processing is needed. The MSS performance was evaluated on the two small datasets (Bach10 and URMP) of real instrument recordings where the ground-truth is available. The performances of the causal and non-causal systems were similar. Comparing the average Signal-to-Distortion (SDR) of the synthesized validation set (6.2 dB causal; 6.9 non-causal), to the real recorded evaluation set (0.3 dB causal, 0.4 dB non-causal), shows that mismatch between synthesized and recorded data is a problem. Future work needs to either gather more real recordings that can be used for training, or to improve the realism and diversity of the synthesized recordings to reduce the mismatch...

Paper Structure

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: First 10 seconds of a quartet from Bach10 and URMP datasets.
  • Figure 2: Block diagram of the TasNet architecture.
  • Figure 3: Ground Truth and estimation of cello from one validation sample.
  • Figure 4: Ground Truth and estimation of bassoon from one validation sample.