Table of Contents
Fetching ...

ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Weihuang Liu, Xi Shen, Chi-Man Pun, Xiaodong Cun

TL;DR

ForgeryTTT is introduced, the first method leveraging test-time training (TTT) to identify manipulated regions in images and achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques.

Abstract

Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.

ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

TL;DR

ForgeryTTT is introduced, the first method leveraging test-time training (TTT) to identify manipulated regions in images and achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques.

Abstract

Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1% improvement in localization accuracy compared to other zero-shot methods and a 4.3% improvement over non-zero-shot techniques. Our code and data will be released upon publication.
Paper Structure (20 sections, 8 equations, 10 figures, 8 tables)

This paper contains 20 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of previous and our testing phase. Previous methods directly employ the models for forgery localization, while we first perform model adaptation for each image and then localize the forgery region.
  • Figure 2: Examples of localization results from our method, ForgeryTTT, on testing images. Without ForgeryTTT, the model fails to accurately localize the forgery regions. However, performing adaptation on each image (w/ ForgeryTTT), shows significantly better results.
  • Figure 3: The overview of the proposed ForgeryTTT. ForgeryTTT is a multi-task framework built upon the common encoder-decoder image manipulation localization network, which includes a shared image encoder, a localization head, and a classification head. It is first learned to image manipulation localization and image manipulation classification on large-scale datasets. Then, we employ a self-supervised loss based on the classification head to train the image encoder for each test image. Finally, the updated model is used to localize the forgery region.
  • Figure 4: The proposed self-supervised image manipulation classification algorithm. We first extract the image features using the image encoder. Then, we group foreground manipulated tokens and background authentic tokens via a given mask. Random token dropout is applied in both foreground and background tokens. Next, the foreground tokens, background tokens, and class tokens are concatenated as manipulated queries, and background tokens and class tokens are concatenated as authentic queries. Finally, the classification head is learned to distinguish these two kinds of queries.
  • Figure 5: The details of the proposed classification head. The classification head merges the multi-scale features into tokens and outputs the probability of whether the given tokens are manipulated.
  • ...and 5 more figures