Table of Contents
Fetching ...

WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech

Hirotaka Hiraki, Shusuke Kanazawa, Takahiro Miura, Manabu Yoshida, Masaaki Mochimaru, Jun Rekimoto

TL;DR

WhisperMask introduces a mask-type, electret condenser microphone with a large, conductive-dielectric diaphragm designed to emphasize the wearer’s voice while suppressing ambient noise in noisy environments. The study characterizes WhisperMask acoustically via swept-sine impulse responses, and evaluates performance using SNR, subjective audio quality (MUSHRA), and automatic speech recognition across natural and whispered speech. Across all metrics, WhisperMask outperforms conventional wearables (PinMic, AirPods Pro2, ThroatMic) and remains robust without denoising, achieving substantially higher whispered-speech recognition in high-noise conditions. The work highlights practical benefits for private, hands-free voice interactions in real-world noisy settings while noting considerations such as daily wearability, motion, and wind effects.

Abstract

Whispering is a common privacy-preserving technique in voice-based interactions, but its effectiveness is limited in noisy environments. In conventional hardware- and software-based noise reduction approaches, isolating whispered speech from ambient noise and other speech sounds remains a challenge. We thus propose WhisperMask, a mask-type microphone featuring a large diaphragm with low sensitivity, making the wearer's voice significantly louder than the background noise. We evaluated WhisperMask using three key metrics: signal-to-noise ratio, quality of recorded voices, and speech recognition rate. Across all metrics, WhisperMask consistently outperformed traditional noise-suppressing microphones and software-based solutions. Notably, WhisperMask showed a 30% higher recognition accuracy for whispered speech recorded in an environment with 80 dB background noise compared with the pin microphone and earbuds. Furthermore, while a denoiser decreased the whispered speech recognition rate of these two microphones by approximately 20% at 30-60 dB noise, WhisperMask maintained a high performance even without denoising, surpassing the other microphones' performances by a significant margin.WhisperMask's design renders the wearer's voice as the dominant input and effectively suppresses background noise without relying on signal processing. This device allows for reliable voice interactions, such as phone calls and voice commands, in a wide range of noisy real-world scenarios while preserving user privacy.

WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech

TL;DR

WhisperMask introduces a mask-type, electret condenser microphone with a large, conductive-dielectric diaphragm designed to emphasize the wearer’s voice while suppressing ambient noise in noisy environments. The study characterizes WhisperMask acoustically via swept-sine impulse responses, and evaluates performance using SNR, subjective audio quality (MUSHRA), and automatic speech recognition across natural and whispered speech. Across all metrics, WhisperMask outperforms conventional wearables (PinMic, AirPods Pro2, ThroatMic) and remains robust without denoising, achieving substantially higher whispered-speech recognition in high-noise conditions. The work highlights practical benefits for private, hands-free voice interactions in real-world noisy settings while noting considerations such as daily wearability, motion, and wind effects.

Abstract

Whispering is a common privacy-preserving technique in voice-based interactions, but its effectiveness is limited in noisy environments. In conventional hardware- and software-based noise reduction approaches, isolating whispered speech from ambient noise and other speech sounds remains a challenge. We thus propose WhisperMask, a mask-type microphone featuring a large diaphragm with low sensitivity, making the wearer's voice significantly louder than the background noise. We evaluated WhisperMask using three key metrics: signal-to-noise ratio, quality of recorded voices, and speech recognition rate. Across all metrics, WhisperMask consistently outperformed traditional noise-suppressing microphones and software-based solutions. Notably, WhisperMask showed a 30% higher recognition accuracy for whispered speech recorded in an environment with 80 dB background noise compared with the pin microphone and earbuds. Furthermore, while a denoiser decreased the whispered speech recognition rate of these two microphones by approximately 20% at 30-60 dB noise, WhisperMask maintained a high performance even without denoising, surpassing the other microphones' performances by a significant margin.WhisperMask's design renders the wearer's voice as the dominant input and effectively suppresses background noise without relying on signal processing. This device allows for reliable voice interactions, such as phone calls and voice commands, in a wide range of noisy real-world scenarios while preserving user privacy.
Paper Structure (45 sections, 9 figures, 1 table)

This paper contains 45 sections, 9 figures, 1 table.

Figures (9)

  • Figure 2: Overview of WhisperMask as a masked microphone. WhisperMask is a mask-type microphone that allows for hands-free, non-obtrusive input (right). The microphone is sandwiched between the fabric of two non-woven masks (center). The diaphragm is connected to a microcontroller and can be used on a PC or smartphone via USB or Bluetooth (right).
  • Figure 3: Swept-sine wave for measuring impulse response. Five Swept-sine of length 65536 are generated(left). A dummy head with a voice simulator is provided and the microphones are worn for measurement. The speaker is placed 500 mm away from the dummy head(upper right). The impulse response is calculated by convolving the signal obtained by preprocessing to detect Swept-sine with the inverse filter of Swept-sine. (lower right)
  • Figure 4:
  • Figure 8: SNR result for different microphones at 60 dB input
  • Figure 9: The WebUI used to evaluate the quality of the recorded audio clips; the metrics were based on MUSHRAmushra-ITU-R. The participants rated each audio clip on a scale of 0 to 100. The reference audio clip (top row) had a high quality and was used as a criterion for selecting responses. Four of the five test items are recordings captured in a noisy environment by each device. One test item is the same as the reference, any participants who rated this test item as having lower quality than the reference audio will be judged as less faithful respondents.
  • ...and 4 more figures