Table of Contents
Fetching ...

Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang, Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu

TL;DR

This study assesses whether state-of-the-art multi-modal reasoning LLMs can detect deepfakes in static images without task-specific fine-tuning. Using a zero-shot prompt pipeline, the authors benchmark 12 LLMs across diverse face-swap datasets (CDF, FF+, RWDF) and compare against traditional vision models, analyzing both detection performance via ROC-AUC and model reasoning behavior. Key findings show that OpenAI models (e.g., GPT-4o) achieve competitive zero-shot detection and demonstrate some generalization to out-of-distribution data, while many non-OpenAI models underperform, with newer reasoning-enabled variants not consistently improving results. The work highlights the potential and limitations of integrating multi-modal reasoning into deepfake detection, emphasizing the need for robustness, interpretability, and careful consideration of model biases and failure modes in real-world deployment.

Abstract

Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

Can Multi-modal (reasoning) LLMs work as deepfake detectors?

TL;DR

This study assesses whether state-of-the-art multi-modal reasoning LLMs can detect deepfakes in static images without task-specific fine-tuning. Using a zero-shot prompt pipeline, the authors benchmark 12 LLMs across diverse face-swap datasets (CDF, FF+, RWDF) and compare against traditional vision models, analyzing both detection performance via ROC-AUC and model reasoning behavior. Key findings show that OpenAI models (e.g., GPT-4o) achieve competitive zero-shot detection and demonstrate some generalization to out-of-distribution data, while many non-OpenAI models underperform, with newer reasoning-enabled variants not consistently improving results. The work highlights the potential and limitations of integrating multi-modal reasoning into deepfake detection, emphasizing the need for robustness, interpretability, and careful consideration of model biases and failure modes in real-world deployment.

Abstract

Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

Paper Structure

This paper contains 25 sections, 23 figures, 1 table.

Figures (23)

  • Figure 1: Overall experiment design
  • Figure 2: Version ROC Curve
  • Figure 3: Size difference ROC Curve
  • Figure 4: Reasoning ROC Curve
  • Figure 5: ROC Curve for all models
  • ...and 18 more figures