Investigating training objective for flow matching-based speech enhancement
Liusha Yang, Ziru Ge, Gui Zhang, Junan Zhang, Zhizheng Wu
TL;DR
This work systematically evaluates training objectives for flow matching in speech enhancement, comparing velocity prediction, ${\bf x}_1$ prediction, and preconditioned ${\bf x}_1$ prediction. It demonstrates that preconditioned ${\bf x}_1$-prediction (EDM variant) matches velocity performance while accelerating convergence, and shows that adding differentiable PESQ and SI-SDR losses yields faster optimization and improved perceptual quality. The study further shows that, among generative baselines, flow matching with preconditioned ${\bf x}_1$-prediction and joint PESQ+SI-SDR losses offers the best balance across PESQ, ESTOI, SI-SDR, DNSMOS, and WER, with efficient sampling. These findings advance practical, efficient FM-based SE and highlight effective combinations of objective functions for robust speech enhancement.
Abstract
Speech enhancement(SE) aims to recover clean speech from noisy recordings. Although generative approaches such as score matching and Schrodinger bridge have shown strong effectiveness, they are often computationally expensive. Flow matching offers a more efficient alternative by directly learning a velocity field that maps noise to data. In this work, we present a systematic study of flow matching for SE under three training objectives: velocity prediction, $x_1$ prediction, and preconditioned $x_1$ prediction. We analyze their impact on training dynamics and overall performance. Moreover, by introducing perceptual(PESQ) and signal-based(SI-SDR) objectives, we further enhance convergence efficiency and speech quality, yielding substantial improvements across evaluation metrics.
