Table of Contents
Fetching ...

We Can Hear You with mmWave Radar! An End-to-End Eavesdropping System

Dachao Han, Teng Huang, Han Ding, Cui Zhao, Fei Wang, Ge Wang, Wei Xi

TL;DR

mmSpeech addresses speech privacy risks by enabling end-to-end reconstruction of intelligible speech from mmWave-induced vibrations of a surface, even through walls and without prior knowledge of the speaker. It identifies PET film as an optimal vibrating medium and optimizes radar sampling to capture sub-4 kHz speech content, then employs a GAN-based network with spectrum denoising and multi-resolution Mel losses to refine the signal. The system achieves state-of-the-art quality (e.g., FWSegSNR ≈ 9.43 dB, MCD ≈ 5.18, MEL ≈ 2.09 on seen data) and generalizes to unseen speakers and conditions, aided by synthetic data generation and selective ASR encoder fine-tuning that significantly improves transcription accuracy. The work highlights a practical privacy threat and offers defense directions (damping, noise injection, strategic placement) while providing a comprehensive evaluation framework for mmWave-based vibration eavesdropping in through-wall scenarios.

Abstract

With the rise of voice-enabled technologies, loudspeaker playback has become widespread, posing increasing risks to speech privacy. Traditional eavesdropping methods often require invasive access or line-of-sight, limiting their practicality. In this paper, we present mmSpeech, an end-to-end mmWave-based eavesdropping system that reconstructs intelligible speech solely from vibration signals induced by loudspeaker playback, even through walls and without prior knowledge of the speaker. To achieve this, we reveal an optimal combination of vibrating material and radar sampling rate for capturing high-quality vibrations using narrowband mmWave signals. We then design a deep neural network that reconstructs intelligible speech from the estimated noisy spectrograms. To further support downstream speech understanding, we introduce a synthetic training pipeline and selectively fine-tune the encoder of a pre-trained ASR model. We implement mmSpeech with a commercial mmWave radar and validate its performance through extensive experiments. Results show that mmSpeech achieves state-of-the-art speech quality and generalizes well across unseen speakers and various conditions.

We Can Hear You with mmWave Radar! An End-to-End Eavesdropping System

TL;DR

mmSpeech addresses speech privacy risks by enabling end-to-end reconstruction of intelligible speech from mmWave-induced vibrations of a surface, even through walls and without prior knowledge of the speaker. It identifies PET film as an optimal vibrating medium and optimizes radar sampling to capture sub-4 kHz speech content, then employs a GAN-based network with spectrum denoising and multi-resolution Mel losses to refine the signal. The system achieves state-of-the-art quality (e.g., FWSegSNR ≈ 9.43 dB, MCD ≈ 5.18, MEL ≈ 2.09 on seen data) and generalizes to unseen speakers and conditions, aided by synthetic data generation and selective ASR encoder fine-tuning that significantly improves transcription accuracy. The work highlights a practical privacy threat and offers defense directions (damping, noise injection, strategic placement) while providing a comprehensive evaluation framework for mmWave-based vibration eavesdropping in through-wall scenarios.

Abstract

With the rise of voice-enabled technologies, loudspeaker playback has become widespread, posing increasing risks to speech privacy. Traditional eavesdropping methods often require invasive access or line-of-sight, limiting their practicality. In this paper, we present mmSpeech, an end-to-end mmWave-based eavesdropping system that reconstructs intelligible speech solely from vibration signals induced by loudspeaker playback, even through walls and without prior knowledge of the speaker. To achieve this, we reveal an optimal combination of vibrating material and radar sampling rate for capturing high-quality vibrations using narrowband mmWave signals. We then design a deep neural network that reconstructs intelligible speech from the estimated noisy spectrograms. To further support downstream speech understanding, we introduce a synthetic training pipeline and selectively fine-tune the encoder of a pre-trained ASR model. We implement mmSpeech with a commercial mmWave radar and validate its performance through extensive experiments. Results show that mmSpeech achieves state-of-the-art speech quality and generalizes well across unseen speakers and various conditions.

Paper Structure

This paper contains 53 sections, 9 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: An eavesdropper equipped with a portable mmWave radar can covertly reconstruct private speech occurring inside a soundproof room. By capturing subtle vibration signals induced by loudspeaker playback, the attacker can not only recover intelligible speech but also transcribe its content using an ASR system—posing a serious threat to speech privacy even in acoustically isolated environments.
  • Figure 2: The mmSpeech system mainly consists of two main components, i.e., mmWave-based vibration sensing and DNN-based speech reconstruction.
  • Figure 3: Preprocessing of the mmWave radar estimated vibration signals.
  • Figure 4: Frequency response of different vibrating materials.
  • Figure 5: Frequency response vs sampling rate.
  • ...and 11 more figures