DREAM: A Benchmark Study for Deepfake REalism AssessMent
Bo Peng, Zichuan Wang, Sheng Yu, Xiaochuan Jin, Wei Wang, Jing Dong
TL;DR
The DREAM benchmark targets subjective visual realism assessment of deepfakes, addressing a gap beyond binary detection by modeling human perception of realism. It provides a large-scale dataset with MOS-style realism scores and textual artifact descriptions, along with a thorough evaluation of 16 realism methods, including a novel description-aligned DA-CLIP approach. The study demonstrates that fine-tuning and vision-language modeling yield strong performance, and pretraining on deepfake datasets significantly boosts accuracy, underscoring the link between detection and realism assessment. The inclusion of textual explanations and cross-modal analysis offers meaningful interpretability, making DREAM a foundational resource for future research in deepfake realism and multi-modal evaluation of generated visual content.
Abstract
Deep learning based face-swap videos, widely known as deepfakes, have drawn wide attention due to their threat to information credibility. Recent works mainly focus on the problem of deepfake detection that aims to reliably tell deepfakes apart from real ones, in an objective way. On the other hand, the subjective perception of deepfakes, especially its computational modeling and imitation, is also a significant problem but lacks adequate study. In this paper, we focus on the visual realism assessment of deepfakes, which is defined as the automatic assessment of deepfake visual realism that approximates human perception of deepfakes. It is important for evaluating the quality and deceptiveness of deepfakes which can be used for predicting the influence of deepfakes on Internet, and it also has potentials in improving the deepfake generation process by serving as a critic. This paper prompts this new direction by presenting a comprehensive benchmark called DREAM, which stands for Deepfake REalism AssessMent. It is comprised of a deepfake video dataset of diverse quality, a large scale annotation that includes 140,000 realism scores and textual descriptions obtained from 3,500 human annotators, and a comprehensive evaluation and analysis of 16 representative realism assessment methods, including recent large vision language model based methods and a newly proposed description-aligned CLIP method. The benchmark and insights included in this study can lay the foundation for future research in this direction and other related areas.
