Table of Contents
Fetching ...

UVL2: A Unified Framework for Video Tampering Localization

Pengfei Pei

TL;DR

An effective video tampering localization network that significantly improves the detection performance of video inpainting and splicing by extracting more generalized features of forgery traces is proposed.

Abstract

With the advancement of deep learning-driven video editing technology, security risks have emerged. Malicious video tampering can lead to public misunderstanding, property losses, and legal disputes. Currently, detection methods are mostly limited to specific datasets, with limited detection performance for unknown forgeries, and lack of robustness for processed data. This paper proposes an effective video tampering localization network that significantly improves the detection performance of video inpainting and splicing by extracting more generalized features of forgery traces. Considering the inherent differences between tampered videos and original videos, such as edge artifacts, pixel distribution, texture features, and compress information, we have specifically designed four modules to independently extract these features. Furthermore, to seamlessly integrate these features, we employ a two-stage approach utilizing both a Convolutional Neural Network and a Vision Transformer, enabling us to learn these features in a local-to-global manner. Experimental results demonstrate that the method significantly outperforms the existing state-of-the-art methods and exhibits robustness.

UVL2: A Unified Framework for Video Tampering Localization

TL;DR

An effective video tampering localization network that significantly improves the detection performance of video inpainting and splicing by extracting more generalized features of forgery traces is proposed.

Abstract

With the advancement of deep learning-driven video editing technology, security risks have emerged. Malicious video tampering can lead to public misunderstanding, property losses, and legal disputes. Currently, detection methods are mostly limited to specific datasets, with limited detection performance for unknown forgeries, and lack of robustness for processed data. This paper proposes an effective video tampering localization network that significantly improves the detection performance of video inpainting and splicing by extracting more generalized features of forgery traces. Considering the inherent differences between tampered videos and original videos, such as edge artifacts, pixel distribution, texture features, and compress information, we have specifically designed four modules to independently extract these features. Furthermore, to seamlessly integrate these features, we employ a two-stage approach utilizing both a Convolutional Neural Network and a Vision Transformer, enabling us to learn these features in a local-to-global manner. Experimental results demonstrate that the method significantly outperforms the existing state-of-the-art methods and exhibits robustness.
Paper Structure (21 sections, 1 equation, 5 figures, 2 tables)

This paper contains 21 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The localization of tampered areas in video inpainting and video splicing.
  • Figure 2: Overview of the framework. We first extract inconsistencies between the original and tampered regions of the video from four aspects: texture, edges, pixels, and frequency domain. Then, we adopt a two-stage learning structure based on CNN and ViT to achieve correlation learning from local to global, which is used to fuse these inconsistent features. Finally, the output is a video consisting of pixel-level localization results of the tampered region. It's worth noting that within each stage, the Features Fusion module combines features and generates an additional feature branch for input to the next stage.
  • Figure 3: The Sobel operator, Laplacian operator and SRM operator are used in the spatial domain branch.
  • Figure 4: Ablation study on the proposed network on the DAVIS-VI dataset to assess the impact of different features and components.
  • Figure 5: The detection results on the DAVIS-VI and VS dataset are presented. In the results, "Real" represents the original video, "Fake" indicates the tampered video, "Mask" indicates the ground truth, and "Ours" represents the detection results of our method.