Table of Contents
Fetching ...

LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

Dat Nguyen, Enjie Ghorbel, Anis Kacem, Marcella Astrid, Djamila Aouada

Abstract

In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnote{https://github.com/10Ring/LAA-Net} and LAA-Former\footnote{https://github.com/10Ring/LAA-Former} are publicly available.

LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection

Abstract

In this paper, we propose Localized Artifact Attention X (LAA-X), a novel deepfake detection framework that is both robust to high-quality forgeries and capable of generalizing to unseen manipulations. Existing approaches typically rely on binary classifiers coupled with implicit attention mechanisms, which often fail to generalize beyond known manipulations. In contrast, LAA-X introduces an explicit attention strategy based on a multi-task learning framework combined with blending-based data synthesis. Auxiliary tasks are designed to guide the model toward localized, artifact-prone (i.e., vulnerable) regions. The proposed framework is compatible with both CNN and transformer backbones, resulting in two different versions, namely, LAA-Net and LAA-Former, respectively. Despite being trained only on real and pseudo-fake samples, LAA-X competes with state-of-the-art methods across multiple benchmarks. Code and pre-trained weights for LAA-Net\footnote{https://github.com/10Ring/LAA-Net} and LAA-Former\footnote{https://github.com/10Ring/LAA-Former} are publicly available.

Paper Structure

This paper contains 35 sections, 15 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: Comparison of LAA-Net($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$), LAA-Former($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$), and LAA-Swin($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) with respect to existing methods, namely, Multi-attentional($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) multi-attentional, SBI($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) sbi, Xception($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) ff++, RECCE($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) ete_recons, CADDM($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) caddm, FAViT($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) FAViT, ForensicsAdapter($\mathbin{\vcenter{\hbox{$\m@th\bullet$}{}}}$) forensicsadapter using (a) the AUC performance with respect to different ranges of Mask-SSIM, and (b) its associated boxplots. *The results were obtained using the official source codes pretrained on FF+ ff++ and testing on Celeb-DFv2 celeb_df. Figure best viewed in colors.
  • Figure 2: Overview of the proposed LAA-X framework. LAA-X is a multi-task learning framework that incorporates an explicit attention mechanism to vulnerable regions through the integration of generic auxiliary tasks. This strategy enables LAA-X to adequately attend to fine-grained artifact-prone areas. Particularly, these additional tasks can be removed at inference, reducing computational cost at test time.
  • Figure 3: Overview of the proposed LAA-Net approach. It is formed by two main components, namely, (1) an explicit attention mechanism based on a multi-task learning framework composed of three branches, i.e., the binary classification branch, the heatmap branch, and the self-consistency branch. The heatmap and self-consistency ground-truth data are generated based on the detected vulnerable points, and (2) an Enhanced Feature Pyramid Networks (E-FPN) that aggregates multi-scale features.
  • Figure 4: Extraction of the vulnerable points.
  • Figure 5: Architecture of the proposed Enhanced Feature Pyramid Network (E-FPN).
  • ...and 9 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 1.1
  • Definition 1.2