Table of Contents
Fetching ...

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Peidong Wang, Naoyuki Kanda, Jian Xue, Jinyu Li, Xiaofei Wang, Aswin Shanmugam Subramanian, Junkun Chen, Sunit Sivasankaran, Xiong Xiao, Yong Zhao

TL;DR

This work tackles streaming multi-talker speech translation by jointly enabling real-time speaker change detection and gender classification. It integrates a token-level t-vector speaker-embedding module with a Transformer transducer-based streaming multilingual ST system, using cosine similarity between adjacent t-vectors to detect speaker changes and to classify gender via comparison with gender profiles. The approach preserves ST performance while providing low-latency speaker-aware outputs, achieving high token-level gender accuracy ($0.989$) and robust speaker-change metrics (F1 above ~0.66 on streaming 1 s chunks, with competitiveness to offline baselines). The contributions advance practical streaming ST by supporting audio prompts for zero-shot TTS and speaker-profile-based TTS in multilingual settings, with data-efficient training via SID-centered objectives. This has significant implications for real-time, speaker-aware translation pipelines and downstream TTS systems.

Abstract

Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

TL;DR

This work tackles streaming multi-talker speech translation by jointly enabling real-time speaker change detection and gender classification. It integrates a token-level t-vector speaker-embedding module with a Transformer transducer-based streaming multilingual ST system, using cosine similarity between adjacent t-vectors to detect speaker changes and to classify gender via comparison with gender profiles. The approach preserves ST performance while providing low-latency speaker-aware outputs, achieving high token-level gender accuracy () and robust speaker-change metrics (F1 above ~0.66 on streaming 1 s chunks, with competitiveness to offline baselines). The contributions advance practical streaming ST by supporting audio prompts for zero-shot TTS and speaker-profile-based TTS in multilingual settings, with data-efficient training via SID-centered objectives. This has significant implications for real-time, speaker-aware translation pipelines and downstream TTS systems.

Abstract

Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.

Paper Structure

This paper contains 19 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of Transformer transducer for ST.
  • Figure 2: Illustration of the reception field of a streaming T-T at position $f_{10}$ with chunk size 3 and the number of left chunks 1.
  • Figure 3: Illustration of LAMASSU-UNI.
  • Figure 4: Illustration of t-vector model for ST.
  • Figure 5: Illustration of the speaker encoder layers. The speaker ID extractor is typically a d-vector extractor.
  • ...and 2 more figures