Table of Contents
Fetching ...

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines

Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung

TL;DR

Problem: assessing verbal privacy in real-world settings when using low-frequency speech recordings that still enable analysis of social dynamics. Approach: the authors down-sample and analyze across datasets, measuring $FER$, $WER$, and $eSTOI$, and test privacy risks via bandwidth-extension attacks using neural BWE models trained on $16$ kHz (VCTK) and REWIND data. Key findings: practical privacy-preserving thresholds around $800$ Hz for VAD and $2000$ Hz for blocking intelligible content; bandwidth-extension can recover some information (primarily stop-words) but human intelligibility remains limited; privacy is not absolute against advanced attacks. Significance: findings guide design of privacy-conscious wearables and motivate robust defenses and attack-aware evaluation for real-world speech privacy.

Abstract

Low-frequency audio has been proposed as a promising privacy-preserving modality to study social dynamics in real-world settings. To this end, researchers have developed wearable devices that can record audio at frequencies as low as 1250 Hz to mitigate the automatic extraction of the verbal content of speech that may contain private details. This paper investigates the validity of this hypothesis, examining the degree to which low-frequency speech ensures verbal privacy. It includes simulating a potential privacy attack in various noise environments. Further, it explores the trade-off between the performance of voice activity detection, which is fundamental for understanding social behavior, and privacy-preservation. The evaluation incorporates subjective human intelligibility and automatic speech recognition performance, comprehensively analyzing the delicate balance between effective social behavior analysis and preserving verbal privacy.

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines

TL;DR

Problem: assessing verbal privacy in real-world settings when using low-frequency speech recordings that still enable analysis of social dynamics. Approach: the authors down-sample and analyze across datasets, measuring , , and , and test privacy risks via bandwidth-extension attacks using neural BWE models trained on kHz (VCTK) and REWIND data. Key findings: practical privacy-preserving thresholds around Hz for VAD and Hz for blocking intelligible content; bandwidth-extension can recover some information (primarily stop-words) but human intelligibility remains limited; privacy is not absolute against advanced attacks. Significance: findings guide design of privacy-conscious wearables and motivate robust defenses and attack-aware evaluation for real-world speech privacy.

Abstract

Low-frequency audio has been proposed as a promising privacy-preserving modality to study social dynamics in real-world settings. To this end, researchers have developed wearable devices that can record audio at frequencies as low as 1250 Hz to mitigate the automatic extraction of the verbal content of speech that may contain private details. This paper investigates the validity of this hypothesis, examining the degree to which low-frequency speech ensures verbal privacy. It includes simulating a potential privacy attack in various noise environments. Further, it explores the trade-off between the performance of voice activity detection, which is fundamental for understanding social behavior, and privacy-preservation. The evaluation incorporates subjective human intelligibility and automatic speech recognition performance, comprehensively analyzing the delicate balance between effective social behavior analysis and preserving verbal privacy.
Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the study. From datasets with and without mingle setting (Section 3.1), we process the audio samples into low-frequency speech audio (Section 3.2) and bandwidth-extended low-frequency speech audio (Section 3.3).
  • Figure 2: Performances (means and standard deviations) of rVAD on different sample rates comparing to original ones
  • Figure 3: Performances of Whisper on different frequencies compared to the ground truth transcripts and of speech intelligibility prediction from eSTOI on different frequencies compared to the original speech signals respectively.
  • Figure 4: Performances of ASR with BWE and without BWE on Pop-glass and VCTK audio respectively with sample rates, 800, 1250, and 2000 Hz compared to the ground truth transcripts
  • Figure 5: Mean and standard deviation of Q1 and Q2
  • ...and 1 more figures