Table of Contents
Fetching ...

Identity-Aware Vision-Language Model for Explainable Face Forgery Detection

Junhao Xu, Jingjing Chen, Yang Jiao, Jiacheng Zhang, Zhiyu Tan, Hao Li, Yu-Gang Jiang

TL;DR

The paper tackles the challenge of robust, explainable face forgery detection in real-world settings where forgeries preserve plausible visuals but violate identity or contextual semantics. It introduces a personalized vision-language model built on a LLaVA backbone, which injects identity priors through specialized tokens and uses a lightweight Detection Adapter to preserve low-level artifacts. The approach demonstrates strong performance on the IDImage dataset, achieving 94.25% accuracy and 94.08% F1 with only 10 extra tokens, while providing interpretable explanations for forgery judgments. The findings suggest that combining high-level semantic reasoning with preserved low-level evidence and identity personalization yields superior generalization to unseen manipulations and practical, explainable forensic capabilities.

Abstract

Recent advances in generative artificial intelligence have enabled the creation of highly realistic image forgeries, raising significant concerns about digital media authenticity. While existing detection methods demonstrate promising results on benchmark datasets, they face critical limitations in real-world applications. First, existing detectors typically fail to detect semantic inconsistencies with the person's identity, such as implausible behaviors or incompatible environmental contexts in given images. Second, these methods rely heavily on low-level visual cues, making them effective for known forgeries but less reliable against new or unseen manipulation techniques. To address these challenges, we present a novel personalized vision-language model (VLM) that integrates low-level visual artifact analysis and high-level semantic inconsistency detection. Unlike previous VLM-based methods, our approach avoids resource-intensive supervised fine-tuning that often struggles to preserve distinct identity characteristics. Instead, we employ a lightweight method that dynamically encodes identity-specific information into specialized identifier tokens. This design enables the model to learn distinct identity characteristics while maintaining robust generalization capabilities. We further enhance detection capabilities through a lightweight detection adapter that extracts fine-grained information from shallow features of the vision encoder, preserving critical low-level evidence. Comprehensive experiments demonstrate that our approach achieves 94.25% accuracy and 94.08% F1 score, outperforming both traditional forgery detectors and general VLMs while requiring only 10 extra tokens.

Identity-Aware Vision-Language Model for Explainable Face Forgery Detection

TL;DR

The paper tackles the challenge of robust, explainable face forgery detection in real-world settings where forgeries preserve plausible visuals but violate identity or contextual semantics. It introduces a personalized vision-language model built on a LLaVA backbone, which injects identity priors through specialized tokens and uses a lightweight Detection Adapter to preserve low-level artifacts. The approach demonstrates strong performance on the IDImage dataset, achieving 94.25% accuracy and 94.08% F1 with only 10 extra tokens, while providing interpretable explanations for forgery judgments. The findings suggest that combining high-level semantic reasoning with preserved low-level evidence and identity personalization yields superior generalization to unseen manipulations and practical, explainable forensic capabilities.

Abstract

Recent advances in generative artificial intelligence have enabled the creation of highly realistic image forgeries, raising significant concerns about digital media authenticity. While existing detection methods demonstrate promising results on benchmark datasets, they face critical limitations in real-world applications. First, existing detectors typically fail to detect semantic inconsistencies with the person's identity, such as implausible behaviors or incompatible environmental contexts in given images. Second, these methods rely heavily on low-level visual cues, making them effective for known forgeries but less reliable against new or unseen manipulation techniques. To address these challenges, we present a novel personalized vision-language model (VLM) that integrates low-level visual artifact analysis and high-level semantic inconsistency detection. Unlike previous VLM-based methods, our approach avoids resource-intensive supervised fine-tuning that often struggles to preserve distinct identity characteristics. Instead, we employ a lightweight method that dynamically encodes identity-specific information into specialized identifier tokens. This design enables the model to learn distinct identity characteristics while maintaining robust generalization capabilities. We further enhance detection capabilities through a lightweight detection adapter that extracts fine-grained information from shallow features of the vision encoder, preserving critical low-level evidence. Comprehensive experiments demonstrate that our approach achieves 94.25% accuracy and 94.08% F1 score, outperforming both traditional forgery detectors and general VLMs while requiring only 10 extra tokens.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: While existing detectors and fine-tuned VLMs can handle known manipulation methods (Case 1), they usually struggle with novel forgery techniques (Case 2). Our approach leverages both low-level artifact detection and high-level semantic analysis through personalized identity priors, enabling robust detection with explanatory reasoning for both scenarios.
  • Figure 2: Overview of our proposed framework. Given a query image and a small set of authentic reference images, our approach (1) extracts personalized identity priors encoding both appearance and behavioral characteristics through specialized tokens (<id_a> and <id_b>), (2) leverages a lightweight Detection Adapter that preserves low-level visual artifacts from shallow layers of the vision encoder. The model can identify both visual inconsistencies (e.g., unnatural blending boundaries) and semantic implausibilities (e.g., inconsistent clothing or behavior) through this multi-level feature integration.
  • Figure 3: Data Curation and Annotation.
  • Figure 4: Qualitative comparison of forgery detection interpretability between our method and GPT-4o