Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Zhuojiang Cai; Yuhan Ma; Feng Lu

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Zhuojiang Cai, Yuhan Ma, Feng Lu

TL;DR

The paper tackles robust keyword spotting for XR headsets by fusing vocal speech and ultrasonic echo cues into a dual-modal KWS system. It implements two fusion strategies—reliability-based and MLP-based—on a HoloLens 2-based prototype with lightweight CNNs and an optimized echoic network using FMCW echoes. Through experiments across noisy environments, silent speech, and nearby-speaker interference, the dual-modal approach consistently outperforms single-modal vocal KWS and retains strong performance in silence, expanding practical XR interaction scenarios. The work demonstrates real-time viability, provides ablation studies on model efficiency, and releases open-source code to facilitate adoption and further improvements.

Abstract

While speech interaction finds widespread utility within the Extended Reality (XR) domain, conventional vocal speech keyword spotting systems continue to grapple with formidable challenges, including suboptimal performance in noisy environments, impracticality in situations requiring silence, and susceptibility to inadvertent activations when others speak nearby. These challenges, however, can potentially be surmounted through the cost-effective fusion of voice and lip movement information. Consequently, we propose a novel vocal-echoic dual-modal keyword spotting system designed for XR headsets. We devise two different modal fusion approches and conduct experiments to test the system's performance across diverse scenarios. The results show that our dual-modal system not only consistently outperforms its single-modal counterparts, demonstrating higher precision in both typical and noisy environments, but also excels in accurately identifying silent utterances. Furthermore, we have successfully applied the system in real-time demonstrations, achieving promising results. The code is available at https://github.com/caizhuojiang/VE-KWS.

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

TL;DR

Abstract

Paper Structure (31 sections, 8 equations, 6 figures, 1 table)

This paper contains 31 sections, 8 equations, 6 figures, 1 table.

Introduction
Related Work
Vocal Speech Keyword Spotting
Silent Speech Interface on Wearables
Speech Interaction in XR
Dual-Modal Keyword Spotting System
System Overview
Hardware Implementation
Vocal Modal KWS
Echoic Modal KWS
Fusion Strategies
Reliability-based Fusion
MLP-based Fusion
Experiments
Data
...and 16 more sections

Figures (6)

Figure 1: Overview of vocal-echoic dual-modal KWS system for XR headset. (left) Hardware diagram of experimental equipment. The speakers and microphones are mounted on the XR headset, connected to an ESP32, which sends the audio to a PC over the network. The PC is used for algorithm implementation and experiments, and it sends the detected keywords back to the application on the headset. (right) Algorithm flowchart. The audio is separately filtered and input into the vocal and echoic modal KWS pipelines. The predicted vectors obtained from these pipelines are then fed into the fusion module to generate the keyword output.
Figure 2: Hardware Setup. (a-b) Front view and bottom view of our implementation with HoloLens headset. (c) ESP32 Add-on board.
Figure 3: (a) The frequency-time diagram of the FMCW signals in the two frequency bands. (b-c) Original and differential Echo Profile.
Figure 4: Frequency distribution of Meeting (left) and Metro (right). The frequency distribution of noise varies across different scenarios, with some noise having a significant presence in high frequencies, while another portion is primarily concentrated in lower frequencies.
Figure 5: Comparison of average Word Error Rates (WER) between single-modal and dual-modal KWS systems in all noise scenarios. Our dual-modal systems (RB Fusion and MLP Fusion) achieve lower WER than single-modal systems (Echoic and Vocal) across all SNRs. At the strongest noise level (SNR=-10.0), MLP fusion reduces WER by 15.68% and 16.57% compared to vocal and echoic systems.
...and 1 more figures

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

TL;DR

Abstract

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Authors

TL;DR

Abstract

Table of Contents

Figures (6)