Table of Contents
Fetching ...

Who Speaks What from Afar: Eavesdropping In-Person Conversations via mmWave Sensing

Shaoying Wang, Hansong Zhou, Yukun Yuan, Xiaonan Zhang

TL;DR

The paper tackles the privacy risk of eavesdropping on in-person meetings using mmWave sensing and the challenge of attributing speech to the correct speaker. It introduces a four-module, unsupervised pipeline that leverages multiple objects to capture distinct vibration signatures, calibrates out static interference, and fuses multi-object information for speaker distinction and speech enhancement. Through real-room experiments and live-speech tests, it demonstrates up to 0.99 accuracy in speaker attribution and robust speech recovery across various object configurations and distances, underscoring significant privacy implications for shared environments. The work highlights practical attack feasibility and motivates developing defenses against through-wall mmWave-based eavesdropping systems.

Abstract

Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.

Who Speaks What from Afar: Eavesdropping In-Person Conversations via mmWave Sensing

TL;DR

The paper tackles the privacy risk of eavesdropping on in-person meetings using mmWave sensing and the challenge of attributing speech to the correct speaker. It introduces a four-module, unsupervised pipeline that leverages multiple objects to capture distinct vibration signatures, calibrates out static interference, and fuses multi-object information for speaker distinction and speech enhancement. Through real-room experiments and live-speech tests, it demonstrates up to 0.99 accuracy in speaker attribution and robust speech recovery across various object configurations and distances, underscoring significant privacy implications for shared environments. The work highlights practical attack feasibility and motivates developing defenses against through-wall mmWave-based eavesdropping systems.

Abstract

Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.

Paper Structure

This paper contains 33 sections, 14 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Illustration of our attack scenario.
  • Figure 2: Illustration of sound pressure-induced surface vibration.
  • Figure 3: Illustration of frequency responses of speech-induced vibrations on objects with a speaker changing position slightly.
  • Figure 4: Static interference Verification.
  • Figure 5: System overview.
  • ...and 20 more figures