Table of Contents
Fetching ...

DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

Tharindu Fernando, Clinton Fookes, Sridha Sridharan

TL;DR

DeepForgeSeal tackles the challenge of proactive deepfake detection by embedding a learnable watermark in the latent, semantic space rather than pixel space. The framework uses a Multi-Agent Adversarial Reinforcement Learning paradigm, with a watermarking agent embedding a 512-bit message in a spherical latent space derived from CLIP features, an adaptive attacker crafting a curriculum of benign and semantic edits, and an extractor that flags tampering via watermark failure measured by BER. Empirical results on CelebA and CelebA-HQ demonstrate superior visual fidelity (PSNR/SSIM) and deepfake-detection performance across diverse generation methods, while achieving robust resilience to benign transforms and fragility to malicious edits. The work contributes a novel latent-space embedding strategy, an adversarial training loop with a learnable attack curriculum, and reward mechanisms that promote semantic drift toward known failure regions, offering a scalable, adaptive approach to real-world media authentication.

Abstract

Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.

DeepForgeSeal: Latent Space-Driven Semi-Fragile Watermarking for Deepfake Detection Using Multi-Agent Adversarial Reinforcement Learning

TL;DR

DeepForgeSeal tackles the challenge of proactive deepfake detection by embedding a learnable watermark in the latent, semantic space rather than pixel space. The framework uses a Multi-Agent Adversarial Reinforcement Learning paradigm, with a watermarking agent embedding a 512-bit message in a spherical latent space derived from CLIP features, an adaptive attacker crafting a curriculum of benign and semantic edits, and an extractor that flags tampering via watermark failure measured by BER. Empirical results on CelebA and CelebA-HQ demonstrate superior visual fidelity (PSNR/SSIM) and deepfake-detection performance across diverse generation methods, while achieving robust resilience to benign transforms and fragility to malicious edits. The work contributes a novel latent-space embedding strategy, an adversarial training loop with a learnable attack curriculum, and reward mechanisms that promote semantic drift toward known failure regions, offering a scalable, adaptive approach to real-world media authentication.

Abstract

Rapid advances in generative AI have led to increasingly realistic deepfakes, posing growing challenges for law enforcement and public trust. Existing passive deepfake detectors struggle to keep pace, largely due to their dependence on specific forgery artifacts, which limits their ability to generalize to new deepfake types. Proactive deepfake detection using watermarks has emerged to address the challenge of identifying high-quality synthetic media. However, these methods often struggle to balance robustness against benign distortions with sensitivity to malicious tampering. This paper introduces a novel deep learning framework that harnesses high-dimensional latent space representations and the Multi-Agent Adversarial Reinforcement Learning (MAARL) paradigm to develop a robust and adaptive watermarking approach. Specifically, we develop a learnable watermark embedder that operates in the latent space, capturing high-level image semantics, while offering precise control over message encoding and extraction. The MAARL paradigm empowers the learnable watermarking agent to pursue an optimal balance between robustness and fragility by interacting with a dynamic curriculum of benign and malicious image manipulations simulated by an adversarial attacker agent. Comprehensive evaluations on the CelebA and CelebA-HQ benchmarks reveal that our method consistently outperforms state-of-the-art approaches, achieving improvements of over 4.5% on CelebA and more than 5.3% on CelebA-HQ under challenging manipulation scenarios.

Paper Structure

This paper contains 29 sections, 16 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Watermarking and watermark retrieval performance of proposed DeepForgeSeal framework. The visualisations include low-resolution and high-resolution bonafide images (row 1), numerous learned attacks against watermarking (row 2), and watermarked images that have undergone semantic-altering manipulations (row 3).
  • Figure 2: Method Overview: Given an input image $x$, the watermarking agent $\theta$ embeds a watermark $w_x$ into the spherical latent space $\mathbb{S}$ derived from semantic (CLIP Image) features $f(x)$, producing a watermarked image $x_w$. An attacker $\eta$ generates an adversarial image $x_a$ to disrupt watermark integrity. $\eta$ uses both benign edits (e.g., JPEG compression, cropping) and semantic-altering manipulations (e.g., face swaps) when generating its attack curriculum. The extractor $\delta$ attempts to recover $w_x$ from any image $x'$; failure to extract a valid watermark flags $x'$ as a potential tampered image, leveraging watermark consistency as a proxy for semantic authenticity.
  • Figure 3: Qualitative Results of Visual Quality: A comparison between the existing state-of-the-art models, FaceSigns neekhara2024facesigns, EditGard zhang2024editguard, and the proposed DeepForgeSeal method for watermarking two sample images from the CelebA dataset liu2015deep.
  • Figure 4: Qualitative results of our DeepForgeSeal model when tested on videos generated by completely unseen image and video deepfake generation techniques (OpenAI SORA, Gemini Veo 3, StyleMask, and Hyper Reenact). We visualise the sample images showing the detected face, along with the original (bonafide) watermarked image before the manipulation (provided in bottom left). For additional visualisations, please refer to the supplementary material.