Continual Learning for Singing Voice Separation with Human in the Loop Adaptation
Ankur Gupta, Anshul Rai, Archit Bansal, Vipul Arora
TL;DR
This work addresses singing voice separation under real-world data shifts by introducing a human-in-the-loop continual learning framework. A base U-Net model is fine-tuned interactively using user-marked false-positive vocal regions, with two adaptation strategies: zero-vocal-targets and synthetic-tracks. Replay-based fine-tuning mitigates forgetting, and experiments on MUSDB-18 and CC-Mixter show that synthetic-track adaptation generally outperforms zero-targets, with mean SDR serving as a superior evaluation metric in HITL settings. The approach is model-agnostic and practical for deployment, with future directions including extending to other stems and incorporating regularization-based continual learning.
Abstract
Deep learning-based works for singing voice separation have performed exceptionally well in the recent past. However, most of these works do not focus on allowing users to interact with the model to improve performance. This can be crucial when deploying the model in real-world scenarios where music tracks can vary from the original training data in both genre and instruments. In this paper, we present a deep learning-based interactive continual learning framework for singing voice separation that allows users to fine-tune the vocal separation model to conform it to new target songs. We use a U-Net-based base model architecture that produces a mask for separating vocals from the spectrogram, followed by a human-in-the-loop task where the user provides feedback by marking a few false positives, i.e., regions in the extracted vocals that should have been silence. We propose two continual learning algorithms. Experiments substantiate the improvement in singing voice separation performance by the proposed algorithms over the base model in intra-dataset and inter-dataset settings.
