Table of Contents
Fetching ...

Study of detecting behavioral signatures within DeepFake videos

Qiaomu Miao, Sinhwa Kang, Stacy Marsella, Steve DiPaola, Chao Wang, Ari Shapiro

TL;DR

This study investigates whether behavioral signatures in talking-face DeepFakes can betray authenticity despite highly realistic visuals. By combining lip-sync and facial reenactment (Wav2Lip and First Order Motion Model), the authors generate appearance-preserving DeepFakes driven by different sources and utterances, and gather human judgments on naturalness and engagement. Across three tests, humans preferred real videos over synthetic ones and identified specific nonverbal cues—particularly mouth movements and facial expressions—as influential, while head movements affected utterance-behavior congruence. The results suggest that behavioral signatures and their alignment with utterances are valuable cues for DeepFake detection, motivating detection models to incorporate dynamic behavior alongside visual quality to remain robust against future advances in video synthesis.

Abstract

There is strong interest in the generation of synthetic video imagery of people talking for various purposes, including entertainment, communication, training, and advertisement. With the development of deep fake generation models, synthetic video imagery will soon be visually indistinguishable to the naked eye from a naturally capture video. In addition, many methods are continuing to improve to avoid more careful, forensic visual analysis. Some deep fake videos are produced through the use of facial puppetry, which directly controls the head and face of the synthetic image through the movements of the actor, allow the actor to 'puppet' the image of another. In this paper, we address the question of whether one person's movements can be distinguished from the original speaker by controlling the visual appearance of the speaker but transferring the behavior signals from another source. We conduct a study by comparing synthetic imagery that: 1) originates from a different person speaking a different utterance, 2) originates from the same person speaking a different utterance, and 3) originates from a different person speaking the same utterance. Our study shows that synthetic videos in all three cases are seen as less real and less engaging than the original source video. Our results indicate that there could be a behavioral signature that is detectable from a person's movements that is separate from their visual appearance, and that this behavioral signature could be used to distinguish a deep fake from a properly captured video.

Study of detecting behavioral signatures within DeepFake videos

TL;DR

This study investigates whether behavioral signatures in talking-face DeepFakes can betray authenticity despite highly realistic visuals. By combining lip-sync and facial reenactment (Wav2Lip and First Order Motion Model), the authors generate appearance-preserving DeepFakes driven by different sources and utterances, and gather human judgments on naturalness and engagement. Across three tests, humans preferred real videos over synthetic ones and identified specific nonverbal cues—particularly mouth movements and facial expressions—as influential, while head movements affected utterance-behavior congruence. The results suggest that behavioral signatures and their alignment with utterances are valuable cues for DeepFake detection, motivating detection models to incorporate dynamic behavior alongside visual quality to remain robust against future advances in video synthesis.

Abstract

There is strong interest in the generation of synthetic video imagery of people talking for various purposes, including entertainment, communication, training, and advertisement. With the development of deep fake generation models, synthetic video imagery will soon be visually indistinguishable to the naked eye from a naturally capture video. In addition, many methods are continuing to improve to avoid more careful, forensic visual analysis. Some deep fake videos are produced through the use of facial puppetry, which directly controls the head and face of the synthetic image through the movements of the actor, allow the actor to 'puppet' the image of another. In this paper, we address the question of whether one person's movements can be distinguished from the original speaker by controlling the visual appearance of the speaker but transferring the behavior signals from another source. We conduct a study by comparing synthetic imagery that: 1) originates from a different person speaking a different utterance, 2) originates from the same person speaking a different utterance, and 3) originates from a different person speaking the same utterance. Our study shows that synthetic videos in all three cases are seen as less real and less engaging than the original source video. Our results indicate that there could be a behavioral signature that is detectable from a person's movements that is separate from their visual appearance, and that this behavioral signature could be used to distinguish a deep fake from a properly captured video.
Paper Structure (17 sections, 6 figures)

This paper contains 17 sections, 6 figures.

Figures (6)

  • Figure 1: Overview of our method. This figure shows our video generation method in Test 1. The Wav2lip model generates lip-synced video with mouth movements corresponding to the audio of the target video. The FOMM model then uses the lip-synced video as the source video with an example target video frame for facial puppeting, to generate output video with the same person's appearance but with behavior signatures transferred from the source video. In Test 2, we removed the lower part related to FOMM by using different videos from the same person for studying the effect of utterance. In Test 3, the source video was replaced with videos acted by humans saying the same utterance to study the effect of behavior style.
  • Figure 2: User Statistics of Test1. (a) and (b) show the percentage of users that showed preference for the video driven by the original actor talking video (green) or other actors' talking videos (blue). (c) shows the individual ratings of the videos from each driving source about how the person in video is like the original actor. Error bar indicates standard deviation. The number on each bar shows the average rating score. The darker blue bar shows the rating of all synthetic videos. Statistical significance in paired t-tests is also annotated. ($^{**}p < 0.01, ^{***}p < 0.001$)
  • Figure 3: User Statistics of Test2. (a) and (b) show the percentage of users with a preference for the original Trump talking video (green) or the lip-synced video with a different utterance (blue). (c) shows the individual ratings of each video about how the person is like Trump. Error bar indicates standard deviation. The number on each bar shows the average rating scores. The darker blue bar shows the rating of all synthetic videos. Statistical significance in paired t-tests is also annotated. ($^{***}p < 0.001$)
  • Figure 4: Frame examples from Test 2. All videos play the same audio and have matching mouth movements to audio. Leftmost video is the original video after modifying with lip sync process. The three other videos on the right are taken from a moment during of the same actor, and thus have head and facial behaviors that of a different utterance.
  • Figure 5: User Statistics of Test3. (a) and (b) show the percentage of users with a preference for videos driven by the original Trump talking video (green) or videos from other lipsyncers (blue). (c) shows the individual ratings about how the person in video is like Trump. Error bar indicates standard deviation. The number on each bar shows the average rating score. The darker blue bar shows the rating of all synthetic videos. Statistical significance in paired t-tests is also annotated. ($^{***}p < 0.001$)
  • ...and 1 more figures