Table of Contents
Fetching ...

CoHear: Conversation Enhancement via Multi-Earphone Collaboration

Lixing He, Yunqi Guo, Zhenyu Yan, Guoliang Xing

TL;DR

CoHear tackles cocktail party deafness by enabling conversation-level speech enhancement through a mobile, infrastructure-free network of earphones. It introduces a conversation-driven network and a robust target-conversation extraction model that uses both non-verbal cues (head orientation) and partial verbal signals, with a two-stage node discovery and a dynamic geometric calibration pipeline within a Mobile WASN. Real-world and simulated evaluations show over 90% conversation-group formation accuracy, up to 8.8 dB Si-SNR improvement, and real-time performance on mobile hardware, backed by a user study with 20 participants. The work demonstrates scalable, bandwidth-aware, multi-device collaboration for enhancing conversations in noisy social environments and outlines clear paths toward hardware integration and open-source release.

Abstract

In crowded places such as conferences, background noise, overlapping voices, and lively interactions make it difficult to have clear conversations. This situation often worsens the phenomenon known as "cocktail party deafness." We present ClearSphere, the collaborative system that enhances speech at the conversation level with multi-earphones. Real-time conversation enhancement requires a holistic modeling of all the members in the conversation, and an effective way to extract the speech from the mixture. ClearSphere bridges the acoustic sensor system and state-of-the-art deep learning for target speech extraction by making two key contributions: 1) a conversation-driven network protocol, and 2) a robust target conversation extraction model. Our networking protocol enables mobile, infrastructure-free coordination among earphone devices. Our conversation extraction model can leverage the relay audio in a bandwidth-efficient way. ClearSphere is evaluated in both real-world experiments and simulations. Results show that our conversation network obtains more than 90\% accuracy in group formation, improves the speech quality by up to 8.8 dB over state-of-the-art baselines, and demonstrates real-time performance on a mobile device. In a user study with 20 participants, ClearSphere has a much higher score than baseline with good usability.

CoHear: Conversation Enhancement via Multi-Earphone Collaboration

TL;DR

CoHear tackles cocktail party deafness by enabling conversation-level speech enhancement through a mobile, infrastructure-free network of earphones. It introduces a conversation-driven network and a robust target-conversation extraction model that uses both non-verbal cues (head orientation) and partial verbal signals, with a two-stage node discovery and a dynamic geometric calibration pipeline within a Mobile WASN. Real-world and simulated evaluations show over 90% conversation-group formation accuracy, up to 8.8 dB Si-SNR improvement, and real-time performance on mobile hardware, backed by a user study with 20 participants. The work demonstrates scalable, bandwidth-aware, multi-device collaboration for enhancing conversations in noisy social environments and outlines clear paths toward hardware integration and open-source release.

Abstract

In crowded places such as conferences, background noise, overlapping voices, and lively interactions make it difficult to have clear conversations. This situation often worsens the phenomenon known as "cocktail party deafness." We present ClearSphere, the collaborative system that enhances speech at the conversation level with multi-earphones. Real-time conversation enhancement requires a holistic modeling of all the members in the conversation, and an effective way to extract the speech from the mixture. ClearSphere bridges the acoustic sensor system and state-of-the-art deep learning for target speech extraction by making two key contributions: 1) a conversation-driven network protocol, and 2) a robust target conversation extraction model. Our networking protocol enables mobile, infrastructure-free coordination among earphone devices. Our conversation extraction model can leverage the relay audio in a bandwidth-efficient way. ClearSphere is evaluated in both real-world experiments and simulations. Results show that our conversation network obtains more than 90\% accuracy in group formation, improves the speech quality by up to 8.8 dB over state-of-the-art baselines, and demonstrates real-time performance on a mobile device. In a user study with 20 participants, ClearSphere has a much higher score than baseline with good usability.

Paper Structure

This paper contains 65 sections, 6 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Conversation group is naturally formed by interaction with humans, CoHear is a collaborative conversation enhancement solution that is driven by the user's intention to talk, automatically find a pal, and deliver clear speech within the same conversation.
  • Figure 2: Overview of the CoHear system. We conduct node discovery for each user and carry out geometric calibration collaboratively. Finally, we perform target conversation extraction based on clues related to the members of the conversation.
  • Figure 3: The node discovery for conversation setup: where we detect the sound source in the first stage and find the corresponding speaker embedding in the second stage.
  • Figure 3: User study – listening test.
  • Figure 4: Network geometric calibration uses the observed distance and direction of arrival, along with user motion, to estimate the locations of all users in the network via iterative optimization.
  • ...and 18 more figures