Table of Contents
Fetching ...

Trusted Video Inpainting Localization via Deep Attentive Noise Learning

Zijie Lou, Gang Cao, Man Lin

TL;DR

Both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of the proposed TruVIL.

Abstract

Digital video inpainting techniques have been substantially improved with deep learning in recent years. Although inpainting is originally designed to repair damaged areas, it can also be used as malicious manipulation to remove important objects for creating false scenes and facts. As such it is significant to identify inpainted regions blindly. In this paper, we present a Trusted Video Inpainting Localization network (TruVIL) with excellent robustness and generalization ability. Observing that high-frequency noise can effectively unveil the inpainted regions, we design deep attentive noise learning in multiple stages to capture the inpainting traces. Firstly, a multi-scale noise extraction module based on 3D High Pass (HP3D) layers is used to create the noise modality from input RGB frames. Then the correlation between such two complementary modalities are explored by a cross-modality attentive fusion module to facilitate mutual feature learning. Lastly, spatial details are selectively enhanced by an attentive noise decoding module to boost the localization performance of the network. To prepare enough training samples, we also build a frame-level video object segmentation dataset of 2500 videos with pixel-level annotation for all frames. Extensive experimental results validate the superiority of TruVIL compared with the state-of-the-arts. In particular, both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of our proposed TruVIL. Code and dataset will be available at https://github.com/multimediaFor/TruVIL.

Trusted Video Inpainting Localization via Deep Attentive Noise Learning

TL;DR

Both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of the proposed TruVIL.

Abstract

Digital video inpainting techniques have been substantially improved with deep learning in recent years. Although inpainting is originally designed to repair damaged areas, it can also be used as malicious manipulation to remove important objects for creating false scenes and facts. As such it is significant to identify inpainted regions blindly. In this paper, we present a Trusted Video Inpainting Localization network (TruVIL) with excellent robustness and generalization ability. Observing that high-frequency noise can effectively unveil the inpainted regions, we design deep attentive noise learning in multiple stages to capture the inpainting traces. Firstly, a multi-scale noise extraction module based on 3D High Pass (HP3D) layers is used to create the noise modality from input RGB frames. Then the correlation between such two complementary modalities are explored by a cross-modality attentive fusion module to facilitate mutual feature learning. Lastly, spatial details are selectively enhanced by an attentive noise decoding module to boost the localization performance of the network. To prepare enough training samples, we also build a frame-level video object segmentation dataset of 2500 videos with pixel-level annotation for all frames. Extensive experimental results validate the superiority of TruVIL compared with the state-of-the-arts. In particular, both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of our proposed TruVIL. Code and dataset will be available at https://github.com/multimediaFor/TruVIL.
Paper Structure (24 sections, 10 equations, 8 figures, 7 tables)

This paper contains 24 sections, 10 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Video inpainting localization. Given an inpainted video (second column), the inpainted regions are identified both spatially and temporally.
  • Figure 2: Overall architecture of the proposed video inpainting localization network TruVIL. Uniformer blocks are employed to capture the inpainting traces both in RGB and noise streams, where the block numbers $\textit{L}_i=\{5, 8, 20, 7\}$. Multilayer perceptron (MLP) layer followed by an attentive noise decoding module are used as a decoder for generating the final binary localization map. (Best viewed in color.)
  • Figure 3: Illustration of inpainting artifacts. From top to bottom: original frames, inpainted RGB frames and their corresponding noise images, and ground-truth inpainting masks. For different deep video inpainting algorithms, i.e., VIkim2020vipami, OPoh2019onion, CPlee2019copy, E2FGVIli2022towards, FuseFormerliu2021fuseformer, STTNzeng2020learning, FGVCgao2020flow, FGTzhang2022flow and ISVIzhang2022inertia, the artifacts incurred by inpainting are hardly observed in the RGB space but clearly visible in the noise domain.
  • Figure 4: Details of the 3D high pass filter (HP3D) layer. The HP-Filter consists of 3 convolution kernels with fixed parameters. $F_{in}$ and $F_{out}$ denote the input and output feature maps with the dimension $T \times H \times W \times C$, respectively.
  • Figure 5: Proposed cross modality attention module (Left) and its three constituent sub-modules (Right).
  • ...and 3 more figures