Table of Contents
Fetching ...

Language-guided Hierarchical Fine-grained Image Forgery Detection and Localization

Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu

TL;DR

This work proposes a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++, which contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules.

Abstract

Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: a multi-branch feature extractor, a language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment pixel-level forgery regions and detect image-level forgery, respectively. Also, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on $8$ by using different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset are available.

Language-guided Hierarchical Fine-grained Image Forgery Detection and Localization

TL;DR

This work proposes a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++, which contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules.

Abstract

Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: a multi-branch feature extractor, a language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment pixel-level forgery regions and detect image-level forgery, respectively. Also, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on by using different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset are available.

Paper Structure

This paper contains 18 sections, 9 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: (a) In this work, we study image forgery detection and localization (IFDL), regardless of forgery method domains. (b) The distribution of forgery regions depends on individual forgery methods. Each color represents one forgery category (x-axis). Each bubble represents one image forgery dataset. The y-axis denotes the average of the forgery area. The bubble's area is proportional to the variance of the forgery area.
  • Figure 2: (a) We represent the forgery attribute of each manipulated image with multiple labels at different levels. (b) For an input image, we encourage the algorithm to classify its fine-grained forgery attributes at different levels, i.e. a $2$-way classification (fully synthesized or partially manipulated) on level $1$. (c) We perform the fine-grained classification via the hierarchical nature of different forgery attributes, where each depth $l$ node's classification probability is conditioned on classification probabilities of neighbor nodes at depth ($l-1$). [Key: Fu. Sy.: Fully Synthesized; Pa. Ma.: Partially manipulated; Diff.: Diffusion model; Cond.: Conditional; Uncond.: Unconditional].
  • Figure 3: (a) The pre-trained CLIP has a powerful visual embedding that can localize and recognize objects of interest, described by the text query, regardless the objects' spatial size. For example, the pre-trained CLIP can understand the existence of figurines and eyebrows, given the corresponding text query. These images and text queries are acquired via the public CLIP website https://rom1504.github.io/clip-retrieval/. (b) Upper: The HiFi-Net contains three modules, which are multi-branch feature extractor, classification module, and localization module. Bottom: We integrated the Language-guided Forgery Localization Enhancer (LFLE) into the existing HiFi-Net, which is then denoted as HiFi-Net++. (c) The HiFi-Net++ can localize manipulation accurately, even when the manipulation area is small.
  • Figure 4: Given the input image, we first leverage color and frequency blocks to extract features. The multi-branch feature extractor () learns feature maps of different resolutions, and these feature maps are used for the fine-grained classification at different levels via the classification module (), detailed in Sec. \ref{['sec_cls']}. The language-guided localization enhancer (), containing the pre-trained CLIP image and text encoders (denoted as $\boldsymbol{\theta}_i$ and $\boldsymbol{\theta}_t$ respectively), takes the input image and pre-defined text input, and then produce the visual embedding and the manipulation score map. The entire process is detailed in Sec. \ref{['sec_lang_guided_fe']}. In the end, the localization module () in Sec. \ref{['sec_loc']} jointly takes HiFi-Net feature map, visual embedding, and manipulation score map to generate the binary mask $\hat{\mathbf{M}}$ that indicates the manipulation area.
  • Figure 5: (a) Two forgery attribute names on level 1 of the hierarchical fine-grained formulation (Fig. \ref{['fig_taxonomy_benchmark']}) are Fully Synthesis and Partial Manipulation. We combine these forgery attribute names with a template (e.g., "Image region is synthesized by {} method”), which is randomly chosen from the template gallery. Therefore, the level 1 text input has two sentences: "image is manipulated by fully-synthesized method", "image is manipulated by partial-manipulated methods". Consequently, the level 2, level 3 and level 4 have $4$, $6$, and $13$ sentences as the text input. We apply the pre-trained CLIP text encoder (e.g., $\boldsymbol{\theta}_{t}$) on text inputs at different levels and obtain text embedding $\mathbf{T}_{1}$, $\mathbf{T}_{2}$, $\mathbf{T}_{3}$ and $\mathbf{T}_{4}$. (b) The pre-trained CLIP image encoder (e.g., $\boldsymbol{\theta}_{i}$) takes the input image, we obtain $\mathbf{F}_{con}$ to represent the visual content, and feature map $\mathbf{Z}$ that maintains the ability to be aligned with text embedding. After that, a refinement module utilizes $\mathbf{F}_{con}$ to conduct refinements on text embeddings at different levels. These refined text embedding ($\mathbf{T}^{\prime}_{b} \text{ with } b \in \{1 \ldots 4\}$) along with $\mathbf{Z}$ generates manipulation score map $\mathbf{S}_{b} \text{ with } b \in \{1 \ldots 4\}$, as the auxiliary signal to help the localization. During the training, we only keep the refinement module as trainable, while pre-trained CLIP image and text encoders are frozen.
  • ...and 9 more figures