Table of Contents
Fetching ...

Video Inpainting Localization with Contrastive Learning

Zijie Lou, Gang Cao, Man Lin

TL;DR

The work tackles blind localization of inpainted regions in forged videos and introduces ViLocal, which combines a 3D Uniformer encoder operating on HP3D noise residuals with supervised contrastive learning to learn discriminative forensic representations. A two-stage training scheme first optimizes the encoder with the contrastive loss $L_{\mathrm{Contra.}}$ and then trains a lightweight decoder with focal loss to generate pixel-wise localization maps, aided by a 2500-video VOS dataset with frame-level annotations. Experiments show ViLocal achieves state-of-the-art localization accuracy, robustness to compression, and strong generalization across unseen inpainting algorithms, underscoring its practical value for video forensics. The work provides a public dataset and a practical framework for forensic localization that can be extended to broader forgery types.

Abstract

Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at https://github.com/multimediaFor/ViLocal.

Video Inpainting Localization with Contrastive Learning

TL;DR

The work tackles blind localization of inpainted regions in forged videos and introduces ViLocal, which combines a 3D Uniformer encoder operating on HP3D noise residuals with supervised contrastive learning to learn discriminative forensic representations. A two-stage training scheme first optimizes the encoder with the contrastive loss and then trains a lightweight decoder with focal loss to generate pixel-wise localization maps, aided by a 2500-video VOS dataset with frame-level annotations. Experiments show ViLocal achieves state-of-the-art localization accuracy, robustness to compression, and strong generalization across unseen inpainting algorithms, underscoring its practical value for video forensics. The work provides a public dataset and a practical framework for forensic localization that can be extended to broader forgery types.

Abstract

Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at https://github.com/multimediaFor/ViLocal.

Paper Structure

This paper contains 11 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of inpainting artifacts extracted by HP3D. From top to bottom: original frames, inpainted frames and corresponding noise images, and ground truth.
  • Figure 2: Proposed video inpainting localization scheme ViLocal. Each 5 consecutive frames is set as an input unit to yield the inpainting localization map of the middle frame. (a) Training stage 1. ViLocal utilizes contrastive supervision to train the encoder network. (b) Training stage 2. ViLocal employs localization supervision to train the decoder network.
  • Figure 3: Illustration of our supervised contrastive learning for video inpainting localization.
  • Figure 4: IoU and F1 for different codecs and CRFs.
  • Figure 5: Qualitative visualization on two DVI2016 videos.