Study of detecting behavioral signatures within DeepFake videos
Qiaomu Miao, Sinhwa Kang, Stacy Marsella, Steve DiPaola, Chao Wang, Ari Shapiro
TL;DR
This study investigates whether behavioral signatures in talking-face DeepFakes can betray authenticity despite highly realistic visuals. By combining lip-sync and facial reenactment (Wav2Lip and First Order Motion Model), the authors generate appearance-preserving DeepFakes driven by different sources and utterances, and gather human judgments on naturalness and engagement. Across three tests, humans preferred real videos over synthetic ones and identified specific nonverbal cues—particularly mouth movements and facial expressions—as influential, while head movements affected utterance-behavior congruence. The results suggest that behavioral signatures and their alignment with utterances are valuable cues for DeepFake detection, motivating detection models to incorporate dynamic behavior alongside visual quality to remain robust against future advances in video synthesis.
Abstract
There is strong interest in the generation of synthetic video imagery of people talking for various purposes, including entertainment, communication, training, and advertisement. With the development of deep fake generation models, synthetic video imagery will soon be visually indistinguishable to the naked eye from a naturally capture video. In addition, many methods are continuing to improve to avoid more careful, forensic visual analysis. Some deep fake videos are produced through the use of facial puppetry, which directly controls the head and face of the synthetic image through the movements of the actor, allow the actor to 'puppet' the image of another. In this paper, we address the question of whether one person's movements can be distinguished from the original speaker by controlling the visual appearance of the speaker but transferring the behavior signals from another source. We conduct a study by comparing synthetic imagery that: 1) originates from a different person speaking a different utterance, 2) originates from the same person speaking a different utterance, and 3) originates from a different person speaking the same utterance. Our study shows that synthetic videos in all three cases are seen as less real and less engaging than the original source video. Our results indicate that there could be a behavioral signature that is detectable from a person's movements that is separate from their visual appearance, and that this behavioral signature could be used to distinguish a deep fake from a properly captured video.
