Table of Contents
Fetching ...

Proactive Hearing Assistants that Isolate Egocentric Conversations

Guilin Hu, Malek Itani, Tuochao Chen, Shyamnath Gollakota

TL;DR

The paper tackles the challenge of the cocktail party problem for hearing aids by proposing a real-time, on-device proactive assistant that identifies wearer's conversational partners without prompts. It introduces a dual-model architecture that uses the wearer's self-speech as an anchor and combines a fast streaming model with a slower embedding model to capture long-range dialogue dynamics, enabling accurate extraction and suppression of competing voices. Training combines synthetic, spatialized data with real-world fine-tuning, and evaluation shows strong generalization across languages and speaker counts, with improvements in SISDRi and PESQ as well as high partner-identity accuracy. The work demonstrates practical feasibility for on-device hearing augmentation and lays groundwork for future integration with dialogue-aware AI systems to track and maintain engagement in noisy, multi-party environments.

Abstract

We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer's self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/

Proactive Hearing Assistants that Isolate Egocentric Conversations

TL;DR

The paper tackles the challenge of the cocktail party problem for hearing aids by proposing a real-time, on-device proactive assistant that identifies wearer's conversational partners without prompts. It introduces a dual-model architecture that uses the wearer's self-speech as an anchor and combines a fast streaming model with a slower embedding model to capture long-range dialogue dynamics, enabling accurate extraction and suppression of competing voices. Training combines synthetic, spatialized data with real-world fine-tuning, and evaluation shows strong generalization across languages and speaker counts, with improvements in SISDRi and PESQ as well as high partner-identity accuracy. The work demonstrates practical feasibility for on-device hearing augmentation and lays groundwork for future integration with dialogue-aware AI systems to track and maintain engagement in noisy, multi-party environments.

Abstract

We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer's self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/

Paper Structure

This paper contains 35 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: In multi-conversation settings, our proactive hearing assistant uses conversational turn-taking dynamics to automatically infers the wearer's conversation partners and suppresses others in real-time.
  • Figure 2: Overview of our model pipeline. A. The streaming beamformer extracts the wearer's self-speech from the binaural mixture. B. Dual-model architecture: the slow model runs every 1s ($T$) on the mixture and self-speech to produce a conversation embedding; the fast model runs every 12.5 ms ($\tau$) on the current mixture and embedding from the previous 1s ($T$), to output the cleaned target conversation.
  • Figure 3: Model enhances then suppresses speaker following shift from target to interfering conversation.
  • Figure 4: SISDRi histogram on egocentric recordings.
  • Figure 5: Extended periods of wearer silence. The gray regions denote durations were the wearer was active.
  • ...and 2 more figures