Table of Contents
Fetching ...

Aurchestra: Fine-Grained, Real-Time Soundscape Control on Resource-Constrained Hearables

Seunghyun Oh, Malek Itani, Aseem Gauri, Shyamnath Gollakota

TL;DR

This work introduces Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables, and shows that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.

Abstract

Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.

Aurchestra: Fine-Grained, Real-Time Soundscape Control on Resource-Constrained Hearables

TL;DR

This work introduces Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables, and shows that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.

Abstract

Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.
Paper Structure (32 sections, 10 figures, 4 tables)

This paper contains 32 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Multi-output sound extraction architecture.
  • Figure 2: (a) CDF plots of the runtime of the three proposed networks on their respective deployment platforms. (b) Runtime and power consumption of running the NeuralAids model at different GAP9 clock frequencies.
  • Figure 3: Sound Event Detection performance comparison on 5-second audio segments. We compare YAMNet, AST, and our fine-tuned AST model across different numbers of simultaneous sound sources. Our fine-tuned model maintains high performance even with 5 concurrent sources.
  • Figure 4: Runtime CDF of the fine-tuned AST model across different iPhone platforms. All platforms complete inference faster than the 5-second audio segment duration.
  • Figure 5: In-the-wild scenarios. The wearer and sound sources were free to move, and head rotation was uncontrolled.
  • ...and 5 more figures