Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Junwon Moon; Hyunjin Choi; Hansol Park; Heeseung Kim; Kyuhong Shim

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Junwon Moon, Hyunjin Choi, Hansol Park, Heeseung Kim, Kyuhong Shim

Abstract

Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Abstract

Paper Structure (32 sections, 16 equations, 6 figures, 4 tables)

This paper contains 32 sections, 16 equations, 6 figures, 4 tables.

Introduction
Related Work
Masking-based Models for TSE
Generative Models for TSE
Efficient SE and TSE
Preliminaries
Target Speaker Extraction
Flow Matching
Mask2Flow-TSE
Motivation: Why Two Stages?
Two-Stage Framework
Stage 1: Masking
Stage 2: Flow Matching
Experimental Setup
Datasets
...and 17 more sections

Figures (6)

Figure 1: Single-stage generative TSE vs. our two-stage Mask2Flow-TSE. Conventional approaches start from Gaussian noise requiring many iterative steps, while our method starts from the masking-enhanced spectrogram, reducing inference to only 1 step.
Figure 2: Cumulative delete--insert (D/I) proportion of a flow-only TSE model across 8 Euler steps. Each bar decomposes the total energy change from the input mixture into Delete (D, red) and Insert (I, blue). Mask: a separately trained discriminative masking model. Target: ground-truth clean speech. (a) Libri2Mix Noisy; (b) Libri2Mix Clean.
Figure 3: The proposed Mask2Flow-TSE architecture with masking and flow matching stages.
Figure 4: Comparison of WER and total system size under speech additive noise. Each point represents a (Whisper backbone, TSE method) pair. Our approach achieves the same WER as Whisper large-v2 alone with $\sim$10$\times$ fewer parameters.
Figure 5: Spectrogram comparison on Libri2Mix test samples. From top to bottom: (a) input mixture, (b) masking stage output, (c) Mask2Flow-TSE output, and (d) clean target.
...and 1 more figures

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Abstract

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Authors

Abstract

Table of Contents

Figures (6)