Improving Source Extraction with Diffusion and Consistency Models
Tornike Karchkhadze, Mohammad Rasool Izadi, Shuo Zhang
TL;DR
This paper proposes a hybrid framework for time-domain musical source extraction that couples a deterministic U‑Net extractor with a score-based diffusion model conditioned on deterministic features and instrument labels. To overcome diffusion's sampling latency, Consistency Distillation is employed to produce fast, single- or few-step generation, with the consistency student sometimes outperforming the teacher without GAN losses. Experiments on Slakh2100 (bass, drums, guitar, piano) demonstrate state-of-the-art SI-SDR improvements, with the 4-step Consistency Distillation setup delivering substantial gains over deterministic baselines and traditional diffusion methods. The approach also analyzes speed-accuracy trade-offs, showing CD offers favorable efficiency relative to diffusion while maintaining high audio quality. Overall, the work introduces the first audio waveform consistency model, achieving superior source extraction performance and enabling faster, near real-time capable processing for music applications.
Abstract
In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.
