Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation
Peidong Wang, Naoyuki Kanda, Jian Xue, Jinyu Li, Xiaofei Wang, Aswin Shanmugam Subramanian, Junkun Chen, Sunit Sivasankaran, Xiong Xiao, Yong Zhao
TL;DR
This work tackles streaming multi-talker speech translation by jointly enabling real-time speaker change detection and gender classification. It integrates a token-level t-vector speaker-embedding module with a Transformer transducer-based streaming multilingual ST system, using cosine similarity between adjacent t-vectors to detect speaker changes and to classify gender via comparison with gender profiles. The approach preserves ST performance while providing low-latency speaker-aware outputs, achieving high token-level gender accuracy ($0.989$) and robust speaker-change metrics (F1 above ~0.66 on streaming 1 s chunks, with competitiveness to offline baselines). The contributions advance practical streaming ST by supporting audio prompts for zero-shot TTS and speaker-profile-based TTS in multilingual settings, with data-efficient training via SID-centered objectives. This has significant implications for real-time, speaker-aware translation pipelines and downstream TTS systems.
Abstract
Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.
