Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

Zhengyang Chen; Bing Han; Shuai Wang; Yidi Jiang; Yanmin Qian

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

Zhengyang Chen, Bing Han, Shuai Wang, Yidi Jiang, Yanmin Qian

TL;DR

This paper implements a Flow-Matching (FM) based generative algorithm within the sequenceto-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system and proposes mapping the binary label sequence into a dense latent space before applying the generative algorithm, which can significantly outperform the traditional Seq2Seq-TSVAD system.

Abstract

Speaker diarization is typically considered a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore the use of neural network-based generative methods for speaker diarization for the first time. We implement a Flow-Matching (FM) based generative algorithm within the sequence-to-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system. Our experiments reveal that applying the generative method directly to the original binary label sequence space of the TS-VAD output is ineffective. To address this issue, we propose mapping the binary label sequence into a dense latent space before applying the generative algorithm and our proposed Flow-TSVAD method outperforms the Seq2Seq-TSVAD system. Additionally, we observe that the FM algorithm converges rapidly during the inference stage, requiring only two inference steps to achieve promising results. As a generative model, Flow-TSVAD allows for sampling different diarization results by running the model multiple times. Moreover, ensembling results from various sampling instances further enhances diarization performance.

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 3 figures, 3 tables)

This paper contains 16 sections, 4 equations, 3 figures, 3 tables.

Introduction
Method
Flow Matching Algorithm
VAD Label Auto-Encoder
Generative TSVAD based on Sequence-to-Sequence Modeling
Experiment Setup
Dataset
Model Configuration
System Training
Evaluation Setup
Results and Analysis
Analysis for the Impact of Different Latent Dimensions
Analysis for Different Inference Steps
Analysis of the Randomness from Different Sampling
Performance Comparison between Different Systems
...and 1 more sections

Figures (3)

Figure 1: System Overview of Flow-TSVAD system.
Figure 2: The DER (%) variation for different inference steps. The results in the figure are infered with the steps 1, 2, 3, 4, 5, 6, 7, 8, 16, 32, respectively.
Figure 3: The DER (%) distribution violin plot for different inference steps. For each inference step, we sample 15 times to generate the diarization results with different random seeds.

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

TL;DR

Abstract

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (3)