SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures
Kuang Yuan, Yifeng Wang, Xiyuxing Zhang, Chengyi Shen, Swarun Kumar, Justin Chan
TL;DR
SonicSieve tackles directional speech extraction on smartphones by pairing a passive, bio-inspired acoustic microstructure with an on-device neural network. The microstructure embeds direction-dependent spectral cues onto the input signal, while a two-microphone setup provides a before/after view that enables robust urban-environment performance; a real-time TF-GridNet-based network learns to extract speech from user-selected sectors. Key contributions include the 20 mm, six-hole resin microstructure design, close microphone integration (~1 cm), and a 6-sector smartphone UI enabling real-time, multi-speaker transcription without large microphone arrays. Empirical results across five real rooms show SI-SDR improvements up to $5.0$ dB for single-sector focus, favorable subjective listening scores, and latency under $8$ ms on common smartphones, with generalization demonstrated via leave-one-room-out validation and cross-device tests. The work offers a practical, open-source pathway toward democratizing directional speech capture on commodity devices.
Abstract
Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.
