Table of Contents
Fetching ...

PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

Govind Mittal, Arthur Jakobsson, Kelly O. Marshall, Chinmay Hegde, Nasir Memon

TL;DR

Pitch offers an AI-assisted tagging framework for real-time audio deepfakes by deploying a rich taxonomy of in-call challenges that degrade deepfake quality and enhance detection. It demonstrates that 10 robust challenges raise machine AUROC to 88.7% and, when combined with human input, yield 84.5% overall detection accuracy while preserving human decision authority. The study provides a large, open dataset (18,600 originals; 1.6M deepfakes across 100 speakers) and a practical, enrollment-free approach to authenticate calls through a human-AI collaboration. These results support deploying Pitch as an adaptable pre-screener to counter real-time voice cloning in finance, government, and healthcare calls, with careful attention to usability and accessibility.

Abstract

The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. This technology represents an existential threat to phone-based authentication systems, while total identity fraud losses reached $43 billion. Unlike traditional robocalls, these personalized AI-generated voice attacks target high-value accounts and circumvent existing defensive measures, creating an urgent cybersecurity challenge. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. Testing against leading voice-cloning systems using a novel dataset (18,600 original and 1.6 million deepfake samples from 100 users), PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score, enabling us to identify 10 highly-effective challenges. For human evaluation, we filtered a challenging, balanced subset on which human evaluators independently achieved 72.6% accuracy, while machines scored 87.7%. Recognizing that call environments require human control, we developed a novel human-AI collaborative system that tags suspicious calls as "Deepfake-likely." Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages, giving users maximum control while boosting detection accuracy to 84.5%. This significant improvement situates PITCH's potential as an AI-assisted pre-screener for verifying calls, offering an adaptable approach to combat real-time voice-cloning attacks while maintaining human decision authority.

PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

TL;DR

Pitch offers an AI-assisted tagging framework for real-time audio deepfakes by deploying a rich taxonomy of in-call challenges that degrade deepfake quality and enhance detection. It demonstrates that 10 robust challenges raise machine AUROC to 88.7% and, when combined with human input, yield 84.5% overall detection accuracy while preserving human decision authority. The study provides a large, open dataset (18,600 originals; 1.6M deepfakes across 100 speakers) and a practical, enrollment-free approach to authenticate calls through a human-AI collaboration. These results support deploying Pitch as an adaptable pre-screener to counter real-time voice cloning in finance, government, and healthcare calls, with careful attention to usability and accessibility.

Abstract

The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. This technology represents an existential threat to phone-based authentication systems, while total identity fraud losses reached $43 billion. Unlike traditional robocalls, these personalized AI-generated voice attacks target high-value accounts and circumvent existing defensive measures, creating an urgent cybersecurity challenge. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. Testing against leading voice-cloning systems using a novel dataset (18,600 original and 1.6 million deepfake samples from 100 users), PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score, enabling us to identify 10 highly-effective challenges. For human evaluation, we filtered a challenging, balanced subset on which human evaluators independently achieved 72.6% accuracy, while machines scored 87.7%. Recognizing that call environments require human control, we developed a novel human-AI collaborative system that tags suspicious calls as "Deepfake-likely." Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages, giving users maximum control while boosting detection accuracy to 84.5%. This significant improvement situates PITCH's potential as an AI-assisted pre-screener for verifying calls, offering an adaptable approach to combat real-time voice-cloning attacks while maintaining human decision authority.
Paper Structure (19 sections, 2 equations, 9 figures, 8 tables)

This paper contains 19 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of Pitch: (1) A company representative requests an audio challenge from a suspicious caller. (2) The caller attempts the challenge. (3) Machine analysis provides a prediction and confidence level. (4) High-confidence machine predictions are accepted; low-confidence cases are sent for manual review. (5) The final decision is made to either accept the call or reject it and notify the genuine customer. Feedback provides further instructions to the representative.
  • Figure 2: Comparison of degradation scores between machine (left) and human (right) evaluations across top-10 challenges. Both panels display boxplots for fake (blue) and original (orange) audio samples with AUROC percentages shown above. Left: Machine-scored degradation arranged by increasing median values of fake samples. Right: Human-scored degradation arranged by increasing AUROC. Higher scores ($\uparrow$) indicate greater degradation and better challenge performance. The different ordering between panels reveals complementary strengths between machine and human detection capabilities. Full version: Fig. \ref{['fig:human_boxplot']}.
  • Figure 3: Machine-assisted evaluation process: (a) Humans make an initial decision, (b) Machine verdict is shown, (c) Humans update their decision and confidence.
  • Figure 4: Tradeoff between deepfake detection accuracy and human decision retention. AI overrides human decisions when its confidence is higher. Temperature calibrates AI confidence, with lower values increasing AI overrides. The intersection point represents equilibrium between human control and AI automation.
  • Figure 5: 100 users' willingness rated on a 5-point Likert scale
  • ...and 4 more figures