Table of Contents
Fetching ...

SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization

Artem Dementyev, Dimitri Kanevsky, Samuel J. Yang, Mathieu Parvaix, Chiong Lai, Alex Olwal

TL;DR

SpeechCompass tackles the challenge of distinguishing who is speaking in mobile captioning by adding real-time, 360° speaker localization and diarization via a four-microphone embedded system. It combines GCC-PHAT-based localization, KDE fusion, and a low-power MCU to deliver low-latency direction-aware transcripts and multiple visualization options on a mobile app. The work provides hardware, algorithms, and UI designs, backed by a foundational large-scale survey (n=263) and user studies (n=8 lab participants; n=494 online) that highlight the practicality and desirability of diarization and directional guidance in group conversations. The results show improved diarization accuracy, acceptable latency, and positive user reception, suggesting that mobile, eye-safe directional captions can meaningfully enhance accessibility and comprehension in real-world social settings.

Abstract

Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware running on a low-power microcontroller and four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization

TL;DR

SpeechCompass tackles the challenge of distinguishing who is speaking in mobile captioning by adding real-time, 360° speaker localization and diarization via a four-microphone embedded system. It combines GCC-PHAT-based localization, KDE fusion, and a low-power MCU to deliver low-latency direction-aware transcripts and multiple visualization options on a mobile app. The work provides hardware, algorithms, and UI designs, backed by a foundational large-scale survey (n=263) and user studies (n=8 lab participants; n=494 online) that highlight the practicality and desirability of diarization and directional guidance in group conversations. The results show improved diarization accuracy, acceptable latency, and positive user reception, suggesting that mobile, eye-safe directional captions can meaningfully enhance accessibility and comprehension in real-world social settings.

Abstract

Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware running on a low-power microcontroller and four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

Paper Structure

This paper contains 33 sections, 4 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Overview of the SpeechCompass phone case prototype. A) A mobile phone application interface with a mounted multi-microphone phone case. B) Inside and outside view of the prototype with a flexible PCB microphone mount and a compact main printed circuit board (PCB). C) Pictures of the main PCB with a top and bottom view.
  • Figure 2: Participant responses to the question What are the biggest challenges with your current captioning or transcription device/technology? (select all that apply)?
  • Figure 3: Survey results of how often participants encountered challenging scenarios with today's transcription technology. The number of participants and percentage is shown for each choice.
  • Figure 4: SpeechCompass system diagram. The phone case contains four microphones connected to a microcontroller. The audio localization algorithms run on the microcontroller, and the angle estimation is sent over USB to the phone. The SpeechCompass app combines ASR input and angle estimations to provide diarization and directional guidance for the mobile captioning UI.
  • Figure 5: Visualization of localization methods with 2 and 3 microphone configurations. A) Localization with two microphones. The sound will arrive at microphone two before microphone one. This time difference could be used to estimate the angle of arrival. However, with two microphones, ambiguity exists, as the source could be at the inverse angle, shown as a "potential source." The graph on the bottom shows the kernel density estimation (KDE) with actual and potential sources. B) With three or more microphones, this angle ambiguity can be avoided. In our implementation, we use four microphones. The KDE from multiple microphone pairs will have the highest peak at the correct source.
  • ...and 9 more figures