Table of Contents
Fetching ...

Spatial Speech Translation: Translating Across Space With Binaural Hearables

Tuochao Chen, Qirui Wang, Runlin He, Shyam Gollakota

TL;DR

This work tackles the problem of translating speech in a wearer’s acoustic space while preserving spatial cues and speaker voice characteristics. It introduces Spatial Speech Translation, a three-part on-device pipeline combining joint localization/separation, simultaneous expressive translation, and binaural rendering, with a robust synthetic-data and real-world training strategy. Key contributions include a real-time, on-device translation model that handles multiple speakers, a separation-aware finetuning regime to boost translation robustness, and three binaural rendering methods evaluated for spatial fidelity. Real-world user studies and comprehensive benchmarks demonstrate strong spatial preservation, competitive translation quality, and practical latency, marking a significant step toward spatial awareness in speech translation for hearables and AR contexts.

Abstract

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.

Spatial Speech Translation: Translating Across Space With Binaural Hearables

TL;DR

This work tackles the problem of translating speech in a wearer’s acoustic space while preserving spatial cues and speaker voice characteristics. It introduces Spatial Speech Translation, a three-part on-device pipeline combining joint localization/separation, simultaneous expressive translation, and binaural rendering, with a robust synthetic-data and real-world training strategy. Key contributions include a real-time, on-device translation model that handles multiple speakers, a separation-aware finetuning regime to boost translation robustness, and three binaural rendering methods evaluated for spatial fidelity. Real-world user studies and comprehensive benchmarks demonstrate strong spatial preservation, competitive translation quality, and practical latency, marking a significant step toward spatial awareness in speech translation for hearables and AR contexts.

Abstract

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.

Paper Structure

This paper contains 31 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of spatial speech translation. The input to our pipeline is a binaural noisy speech mixture in the source language (e.g., French). It consists of three main components: 1) A lightweight, streaming model that separates and localizes individual speech within the binaural mixture, extracting spatial cues for each voice. 2) A streaming speech translation model that translates the separated speech chunks into the target language (e.g., English) while an expressive encoder and vocoder preserve the vocal qualities and expressiveness of the original audio. 3) Binaural rendering to reconstruct binaural playback using the extracted spatial cues.
  • Figure 2: Spatial cues extraction and binaural rendering. (A) shows search-based joint localization and separation. We divide the space into multiple small angular regions and apply streaming TF-GridNet on each region. If no source exists (e.g. $\theta_i$), the model will output zeros. If a source exists (e.g. SPK1 in $\theta_j$), the model outputs the separated binaural signal (SPK1). The spatial cues are extracted from the estimated angle and ILD of the binaural separated output. (B) shows binaural rendering in presence of translation latency. The output speech chunk 0 (green block 0) is generated with 2-chunk delay from the source speech chunk 0 (blue block 0). When we render binaural channel output chunk 0, instead of applying the spatial cues of source chunk 0, the spatial cues of current incoming chunk (source chunk 2) is applied.
  • Figure 3: Simultaneous and expressive speech-to-speech translation. (1) In simultaneous speech to text (S2T) translation, a speech encoder extracts the hidden status $H$ of incoming speech chunks. Source and Target CTC decoders cooperate with a policy algorithm to determine the "WRITE" and "READ" actions for simultaneous translation. When "WRITE" is determined, the text decoder will output target translated text token $Z$. (2) In streaming expressive text-to-speech (T2S) generation, a Text-to-Units (T2U) model converts the text token $Z$ to speech units. The T2U model is trained using target units extracted from the XLS-R-1B model. Meanwhile, the expressive encoder extracts the expressive embedding from the input speech chunk. Finally, the expressive vocoder takes both predicted units and expressive embedding to re-synthesize the target language.
  • Figure 4: Real-world evaluation settings. A-J show ten different unseen multipath environments tested in our real-world generalization evaluation. A-F shows indoor spaces including office spaces, class rooms, common open spaces, conference rooms and as well as work spaces. G-J shows outdoor spaces like school area, fountain park, public picnic lawn and parking lot.
  • Figure 5: Subjective evaluation of semantic consistency and speaker similarity. The left figure shows the mean opinion score for existing translation models without any spatial awareness, our work which performs binaural source separation and translation and finally our work with expressive translation. The right figure shows the corresponding results for speaker similarity between the original French speech and the generated English translation (we use a 1-4 scale).
  • ...and 6 more figures