Who Speaks What from Afar: Eavesdropping In-Person Conversations via mmWave Sensing
Shaoying Wang, Hansong Zhou, Yukun Yuan, Xiaonan Zhang
TL;DR
The paper tackles the privacy risk of eavesdropping on in-person meetings using mmWave sensing and the challenge of attributing speech to the correct speaker. It introduces a four-module, unsupervised pipeline that leverages multiple objects to capture distinct vibration signatures, calibrates out static interference, and fuses multi-object information for speaker distinction and speech enhancement. Through real-room experiments and live-speech tests, it demonstrates up to 0.99 accuracy in speaker attribution and robust speech recovery across various object configurations and distances, underscoring significant privacy implications for shared environments. The work highlights practical attack feasibility and motivates developing defenses against through-wall mmWave-based eavesdropping systems.
Abstract
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
