Table of Contents
Fetching ...

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

Jun Rekimoto, Yu Nishimura, Bojian Yang

TL;DR

NovoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor that generates high-quality speech robust against interference, demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.

Abstract

Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

TL;DR

NovoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor that generates high-quality speech robust against interference, demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.

Abstract

Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.
Paper Structure (12 sections, 5 equations, 10 figures, 1 table)

This paper contains 12 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: NasoVoce sensor configuration: A MEMS microphone and MEMS vibration sensor can be acquired as time-synchronized digital data as the left and right channels of a TDM audio interface. An example usage is shown with the sensor mounted on the nose pad of a smart glasses frame.
  • Figure 2: By covering the mouth and nose with one’s hand, the spoken content cannot be inferred through lip reading. This posture further serves as a social signal, indicating that the person is engaged in a conversation with the device.
  • Figure 3: D-DCCRN (Dual-DCCRN) accepts composite inputs from a microphone (Mic) and a vibration sensor (Vib). D-DCCRN generalizes the design of DCCRN audio enhancement model to jointly process the real and imaginary components of both Mic and Vib signals.
  • Figure 4: The NasoVoce training method: it combines audio enhancement loss ($L_{ae}$) and Knowledge Distillation loss ($L_{kd}$)
  • Figure 5: Audio quality improvement examples (normal and whispered speeches) through the combined use of microphone (Mic) input and vibration sensor (Vib) input: While the Mic input contains external noise, the audio enhancement results obtained from both Mic and Vib inputs (Enhancement) closely approximate the Ground Truth. This outcome is demonstrated through a simulation in which noise is added to the Ground Truth to form the Mic input, and an audio enhancement model is subsequently applied.
  • ...and 5 more figures