Exploring Audio Editing Features as User-Centric Privacy Defenses Against Large Language Model(LLM) Based Emotion Inference Attacks
Mohd. Farhan Israk Soumik, W. K. M. Mithsara, Abdur R. Shahid, Ahmed Imteaj
TL;DR
The paper addresses the privacy risks of emotion inference from speech in contemporary audio-enabled systems and proposes a user-centric defense that uses pitch and tempo manipulation via familiar editing apps. It formalizes a threat model that includes DNNs and LLMs and demonstrates the approach with a lightweight on-device mechanism evaluated on three emotion datasets. Results show that pitch/tempo perturbations significantly degrade emotion inference by both traditional CNNs and large language models, while maintaining usability and cross-platform applicability. The work lays groundwork for practical, user-friendly on-device privacy tools and outlines future directions including more attackers, integration of state-of-the-art SER models, and data-recovery mechanisms for authorized users.
Abstract
The rapid proliferation of speech-enabled technologies, including virtual assistants, video conferencing platforms, and wearable devices, has raised significant privacy concerns, particularly regarding the inference of sensitive emotional information from audio data. Existing privacy-preserving methods often compromise usability and security, limiting their adoption in practical scenarios. This paper introduces a novel, user-centric approach that leverages familiar audio editing techniques, specifically pitch and tempo manipulation, to protect emotional privacy without sacrificing usability. By analyzing popular audio editing applications on Android and iOS platforms, we identified these features as both widely available and usable. We rigorously evaluated their effectiveness against a threat model, considering adversarial attacks from diverse sources, including Deep Neural Networks (DNNs), Large Language Models (LLMs), and and reversibility testing. Our experiments, conducted on three distinct datasets, demonstrate that pitch and tempo manipulation effectively obfuscates emotional data. Additionally, we explore the design principles for lightweight, on-device implementation to ensure broad applicability across various devices and platforms.
