Table of Contents
Fetching ...

SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures

Kuang Yuan, Yifeng Wang, Xiyuxing Zhang, Chengyi Shen, Swarun Kumar, Justin Chan

TL;DR

SonicSieve tackles directional speech extraction on smartphones by pairing a passive, bio-inspired acoustic microstructure with an on-device neural network. The microstructure embeds direction-dependent spectral cues onto the input signal, while a two-microphone setup provides a before/after view that enables robust urban-environment performance; a real-time TF-GridNet-based network learns to extract speech from user-selected sectors. Key contributions include the 20 mm, six-hole resin microstructure design, close microphone integration (~1 cm), and a 6-sector smartphone UI enabling real-time, multi-speaker transcription without large microphone arrays. Empirical results across five real rooms show SI-SDR improvements up to $5.0$ dB for single-sector focus, favorable subjective listening scores, and latency under $8$ ms on common smartphones, with generalization demonstrated via leave-one-room-out validation and cross-device tests. The work offers a practical, open-source pathway toward democratizing directional speech capture on commodity devices.

Abstract

Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.

SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures

TL;DR

SonicSieve tackles directional speech extraction on smartphones by pairing a passive, bio-inspired acoustic microstructure with an on-device neural network. The microstructure embeds direction-dependent spectral cues onto the input signal, while a two-microphone setup provides a before/after view that enables robust urban-environment performance; a real-time TF-GridNet-based network learns to extract speech from user-selected sectors. Key contributions include the 20 mm, six-hole resin microstructure design, close microphone integration (~1 cm), and a 6-sector smartphone UI enabling real-time, multi-speaker transcription without large microphone arrays. Empirical results across five real rooms show SI-SDR improvements up to dB for single-sector focus, favorable subjective listening scores, and latency under ms on common smartphones, with generalization demonstrated via leave-one-room-out validation and cross-device tests. The work offers a practical, open-source pathway toward democratizing directional speech capture on commodity devices.

Abstract

Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.

Paper Structure

This paper contains 13 sections, 10 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Effect of microstructure on incoming sound signals across different angles of arrival. The microstructure introduces larger variations in the frequency response $M_\theta(f)$ across different angles, providing enhanced spatial cues that can be leveraged for directional speech extraction.
  • Figure 2: Acoustic microstructure: principle of operation. The structural elements of the microstructure (holes, tubes, and resonators) form a complex multipath environment that creates variations to incoming acoustic signals based on their direction of arrival.
  • Figure 3: Effect of microstructure material (without holes) on acoustic attenuation. The resin material effectively attenuates sound and ensures that it primarily travels through the microstructure’s holes, rather than its walls.
  • Figure 4: Effect of microstructure diameter on spatial diversity. The microstructure with a larger diameter (20 mm) provides an overall higher spatial diversity across the speech frequencies.
  • Figure 5: A comparison of spatial diversity across different microstructure designs of varied diameter and hole number. Spatial diversity is computed across angles in a semicircle.
  • ...and 10 more figures