TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Beilong Tang; Bang Zeng; Ming Li

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Beilong Tang, Bang Zeng, Ming Li

TL;DR

Problem addressed: target speaker extraction (TSE) aims to isolate one speaker using auxiliary cues but faces generalization gaps in discriminative methods. Approach: TSELM discretizes audio into tokens from multiple WavLM layers ($n=6$, shape $n\times T$, $K=1000$ per layer), uses cross-attention to inject target cues, models token sequences with an encoder-only language model, and reconstructs audio via scalable HiFi-GAN; optimization uses cross-entropy over discretized tokens. Key findings: on Libri2Mix and WSJ0-2mix, TSELM-L achieves higher DNSMOS than a discriminative baseline with competitive intelligibility, while continuous baselines can outperform in dWER; multi-layer discretization and concatenation strategies help, and SSL model choice significantly affects trade-offs. Significance: demonstrates that discrete-token representations combined with language-model sequencing can effectively perform target speaker extraction with high speech quality and a scalable decoding pipeline, suggesting a viable path toward unified, multimodal audio generation workflows.

Abstract

We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

TL;DR

, shape

per layer), uses cross-attention to inject target cues, models token sequences with an encoder-only language model, and reconstructs audio via scalable HiFi-GAN; optimization uses cross-entropy over discretized tokens. Key findings: on Libri2Mix and WSJ0-2mix, TSELM-L achieves higher DNSMOS than a discriminative baseline with competitive intelligibility, while continuous baselines can outperform in dWER; multi-layer discretization and concatenation strategies help, and SSL model choice significantly affects trade-offs. Significance: demonstrates that discrete-token representations combined with language-model sequencing can effectively perform target speaker extraction with high speech quality and a scalable decoding pipeline, suggesting a viable path toward unified, multimodal audio generation workflows.

Abstract

Paper Structure (13 sections, 2 figures, 2 tables)

This paper contains 13 sections, 2 figures, 2 tables.

Introduction
Method
Encoding
Modeling
Attention Embedding
Cross Attention
Language Modeling
Experiments Setup
Training
Evaluation
Baseline models
Results and Discussions
Conclusion

Figures (2)

Figure 1: Overview of our proposed target speaker extraction framework with discrete tokens and language models.
Figure 2: Details of the Cross-Attention mechanism in modeling.

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

TL;DR

Abstract

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)