Table of Contents
Fetching ...

PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali

TL;DR

PhonemeFake tackles the gap between realistic DF threats and detection by presenting a language-driven, segmental attack that targets semantically critical speech regions while preserving surrounding content. It introduces PhonemeFake DF generation and a fast bilevel detection model (PFD) with LF and HF streams and a Gumbel-Softmax gate to localize manipulations with high precision and low compute. Key results show PF dramatically lowers human and ML detection efficacy and that PFD achieves up to $91\%$ EER reduction and $\sim90\%$ inference speed-up across datasets. The work provides a realistic benchmark and a scalable, high-resolution detection framework, with publicly released PF data and open-source code for broader research and defense development.

Abstract

Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

TL;DR

PhonemeFake tackles the gap between realistic DF threats and detection by presenting a language-driven, segmental attack that targets semantically critical speech regions while preserving surrounding content. It introduces PhonemeFake DF generation and a fast bilevel detection model (PFD) with LF and HF streams and a Gumbel-Softmax gate to localize manipulations with high precision and low compute. Key results show PF dramatically lowers human and ML detection efficacy and that PFD achieves up to EER reduction and inference speed-up across datasets. The work provides a realistic benchmark and a scalable, high-resolution detection framework, with publicly released PF data and open-source code for broader research and defense development.

Abstract

Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: What would be a more realistic DF attack and how does language reasoning help DF synthesis? Our pipeline for generating segmentally manipulated DF audio samples integrates LM reasoning to emulate plausible and realistic DF attacks. Starting from a bon-a-fide audio recording, the transcription and timing are extracted using Whisper, and a specific target word or phrase is identified for manipulation with an LM. The manipulation process employs one of three strategies: inversion, where a word is replaced with its antonym; insertion, where a negating or altering phrase is added; and deletion, where a critical word is omitted. These modifications are then synthesized using a TTS model to generate the manipulated utterance. The resulting DF retains the original flow, making detection challenging while reflecting real attack scenarios.
  • Figure 2: Do Current DF Datasets Deceive Humans? Perception accuracy on 93 samples from 78 participants shows PhonemeFake is harder to detect than existing DF datasets.
  • Figure 3: How can we efficiently detect fine-grained DF manipulations? Our bilevel detection model first uses an LF stream to identify RoI, then selectively activates an HF stream via Gumbel-Softmax gating for fine-grained analysis, ensuring high accuracy and resolution with minimal compute overhead.