Human Perception of Audio Deepfakes

Nicolas M. Müller; Karla Pizzi; Jennifer Williams

Human Perception of Audio Deepfakes

Nicolas M. Müller, Karla Pizzi, Jennifer Williams

TL;DR

This paper tackles the problem of human versus AI detection of audio deepfakes by deploying a gamified online experiment where 472 participants compete against a state-of-the-art detector trained on the ASVspoof 2019 eval set, across 14,912 rounds. It demonstrates that, in realistic conditions, AI and humans exhibit similar strengths and weaknesses and no group achieves superhuman performance; however, a naive AI detector can outperform humans by exploiting data artifacts. The study also reveals modest native-language advantages and age-related declines in detection, with IT experience having little effect, highlighting the need for multilingual datasets and user-focused training to strengthen defenses against audio deepfakes. These findings inform both the development of more robust detection algorithms and practical cybersecurity training programs.

Abstract

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques, however, human detection capabilities are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 472 unique users competed against a state-of-the-art AI deepfake detection algorithm for 14912 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.

Human Perception of Audio Deepfakes

TL;DR

Abstract

Human Perception of Audio Deepfakes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)