Table of Contents
Fetching ...

Continual Learning for Singing Voice Separation with Human in the Loop Adaptation

Ankur Gupta, Anshul Rai, Archit Bansal, Vipul Arora

TL;DR

This work addresses singing voice separation under real-world data shifts by introducing a human-in-the-loop continual learning framework. A base U-Net model is fine-tuned interactively using user-marked false-positive vocal regions, with two adaptation strategies: zero-vocal-targets and synthetic-tracks. Replay-based fine-tuning mitigates forgetting, and experiments on MUSDB-18 and CC-Mixter show that synthetic-track adaptation generally outperforms zero-targets, with mean SDR serving as a superior evaluation metric in HITL settings. The approach is model-agnostic and practical for deployment, with future directions including extending to other stems and incorporating regularization-based continual learning.

Abstract

Deep learning-based works for singing voice separation have performed exceptionally well in the recent past. However, most of these works do not focus on allowing users to interact with the model to improve performance. This can be crucial when deploying the model in real-world scenarios where music tracks can vary from the original training data in both genre and instruments. In this paper, we present a deep learning-based interactive continual learning framework for singing voice separation that allows users to fine-tune the vocal separation model to conform it to new target songs. We use a U-Net-based base model architecture that produces a mask for separating vocals from the spectrogram, followed by a human-in-the-loop task where the user provides feedback by marking a few false positives, i.e., regions in the extracted vocals that should have been silence. We propose two continual learning algorithms. Experiments substantiate the improvement in singing voice separation performance by the proposed algorithms over the base model in intra-dataset and inter-dataset settings.

Continual Learning for Singing Voice Separation with Human in the Loop Adaptation

TL;DR

This work addresses singing voice separation under real-world data shifts by introducing a human-in-the-loop continual learning framework. A base U-Net model is fine-tuned interactively using user-marked false-positive vocal regions, with two adaptation strategies: zero-vocal-targets and synthetic-tracks. Replay-based fine-tuning mitigates forgetting, and experiments on MUSDB-18 and CC-Mixter show that synthetic-track adaptation generally outperforms zero-targets, with mean SDR serving as a superior evaluation metric in HITL settings. The approach is model-agnostic and practical for deployment, with future directions including extending to other stems and incorporating regularization-based continual learning.

Abstract

Deep learning-based works for singing voice separation have performed exceptionally well in the recent past. However, most of these works do not focus on allowing users to interact with the model to improve performance. This can be crucial when deploying the model in real-world scenarios where music tracks can vary from the original training data in both genre and instruments. In this paper, we present a deep learning-based interactive continual learning framework for singing voice separation that allows users to fine-tune the vocal separation model to conform it to new target songs. We use a U-Net-based base model architecture that produces a mask for separating vocals from the spectrogram, followed by a human-in-the-loop task where the user provides feedback by marking a few false positives, i.e., regions in the extracted vocals that should have been silence. We propose two continual learning algorithms. Experiments substantiate the improvement in singing voice separation performance by the proposed algorithms over the base model in intra-dataset and inter-dataset settings.

Paper Structure

This paper contains 13 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Pictorial view of problem statement
  • Figure 2: Pipeline for creating synthetic music tracks
  • Figure 3: Replay based continual learning
  • Figure 4: Comparison of model performance ( Mean & Median SDR ) with 10%, 50% and 100% of train data as exemplar
  • Figure 5: Comparison of frame wise SDR of songs before and after HITL
  • ...and 2 more figures