Table of Contents
Fetching ...

Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement

Changsheng Chen, Yongyi Deng, Liangwei Lin, Zitong Yu, Zhimao Lai

TL;DR

This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks, and proposes to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently.

Abstract

Document Presentation Attack Detection (DPAD) is an important measure in protecting the authenticity of a document image. However, recent DPAD methods demand additional resources, such as manual effort in collecting additional data or knowing the parameters of acquisition devices. This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks. We first disentangle the recaptured traces by a self-supervised disentanglement and synthesis network to enhance the generalization capacity in document images with different contents and layouts. Then, unlike the existing DPAD approaches that rely only on data in the RGB domain, we propose to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently. Visualization of the disentangled traces confirms the effectiveness of the proposed method in different document contents. Extensive experiments on three benchmark datasets demonstrate the superiority of our MMDT method on representing forensic traces of recapturing distortion.

Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement

TL;DR

This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks, and proposes to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently.

Abstract

Document Presentation Attack Detection (DPAD) is an important measure in protecting the authenticity of a document image. However, recent DPAD methods demand additional resources, such as manual effort in collecting additional data or knowing the parameters of acquisition devices. This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks. We first disentangle the recaptured traces by a self-supervised disentanglement and synthesis network to enhance the generalization capacity in document images with different contents and layouts. Then, unlike the existing DPAD approaches that rely only on data in the RGB domain, we propose to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently. Visualization of the disentangled traces confirms the effectiveness of the proposed method in different document contents. Extensive experiments on three benchmark datasets demonstrate the superiority of our MMDT method on representing forensic traces of recapturing distortion.
Paper Structure (17 sections, 7 equations, 4 figures, 6 tables)

This paper contains 17 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The proposed DPAD network with disentangled recaptured traces. (a) Our disentanglement network disentangles the blur content $\mathbf{C}$ and the texture $\mathbf{T}$ from an image. (b) Our synthesis network synthesizes a recaptured trace $\hat{\mathbf{G}}_{\mathcal{R}}$ by spatially transforming the disentangled $\mathbf{C}_{\mathcal{R}}$ and $\mathbf{T}_{\mathcal{R}}$ in a recaptured image to fit the content of a genuine image $\mathbf{I}_\mathcal{G}$. (c) The disentanglement network (same as (a)) disentangles the blur content $\mathbf{C}_R^\prime$ and the texture $\mathbf{T}_R^\prime$ from $\hat{\mathbf{I}}_{\mathcal{R}}$. (d) The multi-scale discriminators classify real ($\mathbf{I}_{\mathcal{G}};\mathbf{I}_{\mathcal{R}}$) or reconstructed ($\hat{\mathbf{I}}_{\mathcal{G}};\hat{\mathbf{I}}_{\mathcal{R}}$) images. (e) The proposed DPAD method with multi-modal disentangled traces (MMDT), which takes $\mathbf{I}_\mathcal{G}$ or $\mathbf{I}_\mathcal{R}$ and its disentangled traces $\mathbf{C},\mathbf{T}$ as the input. During training, we finetuned the AMA and classification head while the parameters of the ViT backbone were frozen.
  • Figure 2: Examples of reconstructed recaptured images. (1) Input genuine images $\mathbf{I}_{\mathcal{G}}$ that provide image content. (2) Input recaptured images $\mathbf{I}_{\mathcal{R}}$ that provide recaptured traces through disentanglement network. (3-5) Recaptured images reconstructed with component $\mathbf{C}_\mathcal{R}$, $\mathbf{T}_{\mathcal{R}}$ and forensics traces $G(\mathbf{I})$, respectively. (6) Ground-truth recaptured images with the same contents as $\mathbf{I}_{\mathcal{G}}$ and acquired by the same devices as $\mathbf{I}_{\mathcal{R}}$.
  • Figure 3: Comparison of recaptured traces disentangled by liu2020disentangling and our disentanglement network with self-supervision. (a) a genuine sample, (b-f) recaptured samples.
  • Figure S1: Examples of document images in RSCID by chen2022domain and our RSCID (L).