Table of Contents
Fetching ...

Improving Source Extraction with Diffusion and Consistency Models

Tornike Karchkhadze, Mohammad Rasool Izadi, Shuo Zhang

TL;DR

This paper proposes a hybrid framework for time-domain musical source extraction that couples a deterministic U‑Net extractor with a score-based diffusion model conditioned on deterministic features and instrument labels. To overcome diffusion's sampling latency, Consistency Distillation is employed to produce fast, single- or few-step generation, with the consistency student sometimes outperforming the teacher without GAN losses. Experiments on Slakh2100 (bass, drums, guitar, piano) demonstrate state-of-the-art SI-SDR improvements, with the 4-step Consistency Distillation setup delivering substantial gains over deterministic baselines and traditional diffusion methods. The approach also analyzes speed-accuracy trade-offs, showing CD offers favorable efficiency relative to diffusion while maintaining high audio quality. Overall, the work introduces the first audio waveform consistency model, achieving superior source extraction performance and enabling faster, near real-time capable processing for music applications.

Abstract

In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.

Improving Source Extraction with Diffusion and Consistency Models

TL;DR

This paper proposes a hybrid framework for time-domain musical source extraction that couples a deterministic U‑Net extractor with a score-based diffusion model conditioned on deterministic features and instrument labels. To overcome diffusion's sampling latency, Consistency Distillation is employed to produce fast, single- or few-step generation, with the consistency student sometimes outperforming the teacher without GAN losses. Experiments on Slakh2100 (bass, drums, guitar, piano) demonstrate state-of-the-art SI-SDR improvements, with the 4-step Consistency Distillation setup delivering substantial gains over deterministic baselines and traditional diffusion methods. The approach also analyzes speed-accuracy trade-offs, showing CD offers favorable efficiency relative to diffusion while maintaining high audio quality. Overall, the work introduces the first audio waveform consistency model, achieving superior source extraction performance and enabling faster, near real-time capable processing for music applications.

Abstract

In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.

Paper Structure

This paper contains 23 sections, 18 equations, 2 figures, 3 tables, 2 algorithms.

Figures (2)

  • Figure 1: Diagram illustrating our proposed method. (a) First, we train a mixture-conditional deterministic source extraction model. (b) Next, we introduce a denoising score-matching diffusion model, conditioned both on the features extracted by the deterministic model and instrument label, which farther enhances extracted audio quality through noise addition and removal.
  • Figure 2: SI-SDRi Avg. vs Log($\sigma_{\text{max}}$) for CD and Diffusion Models across 5 Steps. Each subplot compares the performance of the diffusion model (red-square) and the consistency distillation model (blue-o) across different numbers of denoising steps, with a gray dashed line representing the performance of the deterministic model. The x-axis represents $\sigma_{\text{max}}$, the starting noise levels for the models, given in a logarithmic scale.