Table of Contents
Fetching ...

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Zhongweiyang Xu, Ali Aroudi, Ke Tan, Ashutosh Pandey, Jung-Suk Lee, Buye Xu, Francesco Nesta

TL;DR

This work tackles real-time, configurable FoV speech enhancement for smart glasses without relying on target DoA information. It introduces FoVNet, a lightweight neural core (around $\approx 50$ MMACS) that fuses per-block maxDI spatial features, ERB spectral features, and learnable FoV embeddings to estimate a $64$-band ERB gain, complemented by a low-distortion multi-channel Wiener filter (MCWF) and post-processing. The method enables robust enhancement within a configurable FoV and demonstrates improved SI-SDR, PESQ, and STOI over strong baselines, while maintaining an end-to-end latency around 16 ms. This approach paves the way for practical, energy-efficient augmented hearing on wearable glasses, with flexible FoV control suited to daily conversations.

Abstract

This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

TL;DR

This work tackles real-time, configurable FoV speech enhancement for smart glasses without relying on target DoA information. It introduces FoVNet, a lightweight neural core (around MMACS) that fuses per-block maxDI spatial features, ERB spectral features, and learnable FoV embeddings to estimate a -band ERB gain, complemented by a low-distortion multi-channel Wiener filter (MCWF) and post-processing. The method enables robust enhancement within a configurable FoV and demonstrates improved SI-SDR, PESQ, and STOI over strong baselines, while maintaining an end-to-end latency around 16 ms. This approach paves the way for practical, energy-efficient augmented hearing on wearable glasses, with flexible FoV control suited to daily conversations.

Abstract

This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.
Paper Structure (11 sections, 6 equations, 2 figures, 2 tables)

This paper contains 11 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: A user wearing a smart glasses with a mic-array. The horizontal plane is divided into $K=20$ blocks. The FoV (grey blocks) here is $-45^\circ$ to $27^\circ$, containing the target conversation.
  • Figure 2: Configurable FoV Enhancement Pipeline.