TSELM: Target Speaker Extraction using Discrete Tokens and Language Models
Beilong Tang, Bang Zeng, Ming Li
TL;DR
Problem addressed: target speaker extraction (TSE) aims to isolate one speaker using auxiliary cues but faces generalization gaps in discriminative methods. Approach: TSELM discretizes audio into tokens from multiple WavLM layers ($n=6$, shape $n\times T$, $K=1000$ per layer), uses cross-attention to inject target cues, models token sequences with an encoder-only language model, and reconstructs audio via scalable HiFi-GAN; optimization uses cross-entropy over discretized tokens. Key findings: on Libri2Mix and WSJ0-2mix, TSELM-L achieves higher DNSMOS than a discriminative baseline with competitive intelligibility, while continuous baselines can outperform in dWER; multi-layer discretization and concatenation strategies help, and SSL model choice significantly affects trade-offs. Significance: demonstrates that discrete-token representations combined with language-model sequencing can effectively perform target speaker extraction with high speech quality and a scalable decoding pipeline, suggesting a viable path toward unified, multimodal audio generation workflows.
Abstract
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
