Table of Contents
Fetching ...

Human Perception of Audio Deepfakes

Nicolas M. Müller, Karla Pizzi, Jennifer Williams

TL;DR

This paper tackles the problem of human versus AI detection of audio deepfakes by deploying a gamified online experiment where 472 participants compete against a state-of-the-art detector trained on the ASVspoof 2019 eval set, across 14,912 rounds. It demonstrates that, in realistic conditions, AI and humans exhibit similar strengths and weaknesses and no group achieves superhuman performance; however, a naive AI detector can outperform humans by exploiting data artifacts. The study also reveals modest native-language advantages and age-related declines in detection, with IT experience having little effect, highlighting the need for multilingual datasets and user-focused training to strengthen defenses against audio deepfakes. These findings inform both the development of more robust detection algorithms and practical cybersecurity training programs.

Abstract

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques, however, human detection capabilities are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 472 unique users competed against a state-of-the-art AI deepfake detection algorithm for 14912 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.

Human Perception of Audio Deepfakes

TL;DR

This paper tackles the problem of human versus AI detection of audio deepfakes by deploying a gamified online experiment where 472 participants compete against a state-of-the-art detector trained on the ASVspoof 2019 eval set, across 14,912 rounds. It demonstrates that, in realistic conditions, AI and humans exhibit similar strengths and weaknesses and no group achieves superhuman performance; however, a naive AI detector can outperform humans by exploiting data artifacts. The study also reveals modest native-language advantages and age-related declines in detection, with IT experience having little effect, highlighting the need for multilingual datasets and user-focused training to strengthen defenses against audio deepfakes. These findings inform both the development of more robust detection algorithms and practical cybersecurity training programs.

Abstract

The recent emergence of deepfakes has brought manipulated and generated content to the forefront of machine learning research. Automatic detection of deepfakes has seen many new machine learning techniques, however, human detection capabilities are far less explored. In this paper, we present results from comparing the abilities of humans and machines for detecting audio deepfakes used to imitate someone's voice. For this, we use a web-based application framework formulated as a game. Participants were asked to distinguish between real and fake audio samples. In our experiment, 472 unique users competed against a state-of-the-art AI deepfake detection algorithm for 14912 total of rounds of the game. We find that humans and deepfake detection algorithms share similar strengths and weaknesses, both struggling to detect certain types of attacks. This is in contrast to the superhuman performance of AI in many application areas such as object detection or face recognition. Concerning human success factors, we find that IT professionals have no advantage over non-professionals but native speakers have an advantage over non-native speakers. Additionally, we find that older participants tend to be more susceptible than younger ones. These insights may be helpful when designing future cybersecurity training for humans as well as developing better detection algorithms.

Paper Structure

This paper contains 21 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: The web interface as presented to the users. The user is required to listen to an audio file (as often as they like) and then classify the audio via the 'Fake!' or 'Authentic!' button. After classifying, the true label is shown to the user along with the AI algorithm prediction for the gamification approach. Available at https://deepfake-total.com/spot_the_deepfake/.
  • Figure 2: Age distribution of participants who were included in our analysis of the game.
  • Figure 3: Average accuracy per attack. This graph shows the mean attack accuracy per attack ID (with '-' denoting no attack, i.e., bonafide samples). The blue bars indicate the accuracy of the human players, the green bars indicate the accuracy of the AI (RawNet2). The absolute difference is shown in red. The differences between human players and the AI algorithm is small at 9.876 , on average.
  • Figure 4: Mean user accuracy vs. system architecture of the audio spoof system. TTS-based systems (tts) clearly outperform voice conversion (vc) and waveform concatenation (concat).
  • Figure 5: Human detection accuracy grouped by the level of IT expertise (1 -- little knowledge; 5 -- expert knowledge). There is no significant correlation between the level of expertise and the ability to detect audio deepfakes.
  • ...and 3 more figures