Table of Contents
Fetching ...

Source Separation for A Cappella Music

Luca A. Lanzendörfer, Constantin Pinkl, Florian Grötschla

TL;DR

The paper tackles multi-singer source separation in a cappella when the number of active singers varies, introducing SepACap, a waveform-domain model derived from SepReformer that uses periodic activations and a silence-aware composite loss. A power-set data augmentation strategy generates $2^n-1$ mixtures per clip, enabling joint separation and detection of active singers and improving generalization to missing singers. On the JaCappella dataset, SepACap achieves state-of-the-art results for full ensembles and exhibits robust subset performance with reduced bleed-through, albeit with some artifact trade-offs. These contributions advance practical applications in transcription, remixing, and analysis of diverse a cappella performances while alleviating data requirements for training multi-singer separation models.

Abstract

In this work, we study the task of multi-singer separation in a cappella music, where the number of active singers varies across mixtures. To address this, we use a power set-based data augmentation strategy that expands limited multi-singer datasets into exponentially more training samples. To separate singers, we introduce SepACap, an adaptation of SepReformer, a state-of-the-art speaker separation model architecture. We adapt the model with periodic activations and a composite loss function that remains effective when stems are silent, enabling robust detection and separation. Experiments on the JaCappella dataset demonstrate that our approach achieves state-of-the-art performance in both full-ensemble and subset singer separation scenarios, outperforming spectrogram-based baselines while generalizing to realistic mixtures with varying numbers of singers.

Source Separation for A Cappella Music

TL;DR

The paper tackles multi-singer source separation in a cappella when the number of active singers varies, introducing SepACap, a waveform-domain model derived from SepReformer that uses periodic activations and a silence-aware composite loss. A power-set data augmentation strategy generates mixtures per clip, enabling joint separation and detection of active singers and improving generalization to missing singers. On the JaCappella dataset, SepACap achieves state-of-the-art results for full ensembles and exhibits robust subset performance with reduced bleed-through, albeit with some artifact trade-offs. These contributions advance practical applications in transcription, remixing, and analysis of diverse a cappella performances while alleviating data requirements for training multi-singer separation models.

Abstract

In this work, we study the task of multi-singer separation in a cappella music, where the number of active singers varies across mixtures. To address this, we use a power set-based data augmentation strategy that expands limited multi-singer datasets into exponentially more training samples. To separate singers, we introduce SepACap, an adaptation of SepReformer, a state-of-the-art speaker separation model architecture. We adapt the model with periodic activations and a composite loss function that remains effective when stems are silent, enabling robust detection and separation. Experiments on the JaCappella dataset demonstrate that our approach achieves state-of-the-art performance in both full-ensemble and subset singer separation scenarios, outperforming spectrogram-based baselines while generalizing to realistic mixtures with varying numbers of singers.

Paper Structure

This paper contains 4 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Power-set based data augmentation. For each clip, we construct mixtures by selecting subsets of individual stems (e.g., soprano, alto, tenor) and summing the corresponding stems. This procedure generates all $2^n$ possible combinations of the $n$ available stems, yielding a diverse set of mixtures for training and evaluation.
  • Figure 2: We test the ability of models to predict whether a stem is present in a mix. We report F1 scores for stem detection on the test split of the augmented JaCappella dataset. The results show per-stem detection performance for DPTNet, Mel-Band RoFormer, and SepACap, where higher is better in the interval. Our proposed model SepACap achieves the best overall performance in stem detection.