Table of Contents
Fetching ...

A Timely Survey on Vision Transformer for Deepfake Detection

Zhikan Wang, Zhongyao Cheng, Jiajie Xiong, Xun Xu, Tianrui Li, Bharadwaj Veeravalli, Xulei Yang

TL;DR

This paper provides a timely literature review of Vision Transformer (ViT) based deepfake detection as of February 2024, classifying models into standalone, sequential hybrid, and parallel hybrid architectures and outlining 14 ViT-based detectors. It details each model's design principles, datasets, and performance characteristics, and presents benchmark results on FF++ Mesoscopic90 and Celeb-DF to assess cross-dataset robustness. The survey highlights open challenges such as model drift, data scarcity and quality, temporal consistency, and bias, and proposes future research directions including explainability, generalization, multi-modal fusion, and standardized benchmarking. By compiling architectures, empirical insights, and accessible code references, the paper aims to provide a practical and theoretical foundation for researchers and practitioners advancing ViT-based deepfake detection.

Abstract

In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (ViT)-based approaches showcasing superior performance in generality and efficiency. This survey presents a timely overview of ViT-based deepfake detection models, categorized into standalone, sequential, and parallel architectures. Furthermore, it succinctly delineates the structure and characteristics of each model. By analyzing existing research and addressing future directions, this survey aims to equip researchers with a nuanced understanding of ViT's pivotal role in deepfake detection, serving as a valuable reference for both academic and practical pursuits in this domain.

A Timely Survey on Vision Transformer for Deepfake Detection

TL;DR

This paper provides a timely literature review of Vision Transformer (ViT) based deepfake detection as of February 2024, classifying models into standalone, sequential hybrid, and parallel hybrid architectures and outlining 14 ViT-based detectors. It details each model's design principles, datasets, and performance characteristics, and presents benchmark results on FF++ Mesoscopic90 and Celeb-DF to assess cross-dataset robustness. The survey highlights open challenges such as model drift, data scarcity and quality, temporal consistency, and bias, and proposes future research directions including explainability, generalization, multi-modal fusion, and standardized benchmarking. By compiling architectures, empirical insights, and accessible code references, the paper aims to provide a practical and theoretical foundation for researchers and practitioners advancing ViT-based deepfake detection.

Abstract

In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (ViT)-based approaches showcasing superior performance in generality and efficiency. This survey presents a timely overview of ViT-based deepfake detection models, categorized into standalone, sequential, and parallel architectures. Furthermore, it succinctly delineates the structure and characteristics of each model. By analyzing existing research and addressing future directions, this survey aims to equip researchers with a nuanced understanding of ViT's pivotal role in deepfake detection, serving as a valuable reference for both academic and practical pursuits in this domain.
Paper Structure (24 sections, 7 figures, 2 tables)

This paper contains 24 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An overview of the main models discussed in this survey.
  • Figure 2: Overview architecture of ICT. (a) Training phase and (b) Testing phase. (The figure is taken from ICT)
  • Figure 3: Overview architecture of UIA-ViT. (The figure is taken from UIA-ViT)
  • Figure 4: Overview architecture of CVIT (The figure is taken from CVIT)
  • Figure 5: Overview architecture of Khan's model. (The figure is taken from Increment)
  • ...and 2 more figures