Spatial Speech Translation: Translating Across Space With Binaural Hearables
Tuochao Chen, Qirui Wang, Runlin He, Shyam Gollakota
TL;DR
This work tackles the problem of translating speech in a wearer’s acoustic space while preserving spatial cues and speaker voice characteristics. It introduces Spatial Speech Translation, a three-part on-device pipeline combining joint localization/separation, simultaneous expressive translation, and binaural rendering, with a robust synthetic-data and real-world training strategy. Key contributions include a real-time, on-device translation model that handles multiple speakers, a separation-aware finetuning regime to boost translation robustness, and three binaural rendering methods evaluated for spatial fidelity. Real-world user studies and comprehensive benchmarks demonstrate strong spatial preservation, competitive translation quality, and practical latency, marking a significant step toward spatial awareness in speech translation for hearables and AR contexts.
Abstract
Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.
