Table of Contents
Fetching ...

Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

Ying Zhang, Yuezun Li, Bo Peng, Jiaran Zhou, Huiyu Zhou, Junyu Dong

TL;DR

This work tackles the problem of detecting pixel-level inpainting in videos, a forensic priority as generative methods become more realistic. It introduces MumPy, a Multilateral Temporal-view Pyramid Transformer that flexibly fuses spatial and temporal cues through a Multilateral Temporal-view Encoder, a Deformable Window-based Temporal-view Interaction, and a Multi-pyramid Decoder. A new large-scale dataset, YTVI, based on YouTube-VOS, is provided to enable robust cross-domain evaluation across multiple modern inpainting methods. On DVI, FVI, and YTVI benchmarks, MumPy achieves state-of-the-art $mIoU$ and $F1$ scores, demonstrating strong cross-domain generalization and the value of adaptive spatial-temporal clue integration for detection.

Abstract

The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.

Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection

TL;DR

This work tackles the problem of detecting pixel-level inpainting in videos, a forensic priority as generative methods become more realistic. It introduces MumPy, a Multilateral Temporal-view Pyramid Transformer that flexibly fuses spatial and temporal cues through a Multilateral Temporal-view Encoder, a Deformable Window-based Temporal-view Interaction, and a Multi-pyramid Decoder. A new large-scale dataset, YTVI, based on YouTube-VOS, is provided to enable robust cross-domain evaluation across multiple modern inpainting methods. On DVI, FVI, and YTVI benchmarks, MumPy achieves state-of-the-art and scores, demonstrating strong cross-domain generalization and the value of adaptive spatial-temporal clue integration for detection.

Abstract

The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
Paper Structure (10 sections, 4 equations, 8 figures, 9 tables)

This paper contains 10 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Results of our method compared with the others in cross-domain scenarios. The top examples show an obvious temporal relationship while the bottom ones exhibit a strong spatial relationship. These examples demonstrate the significance of flexible collaboration of spatial-temporal clues.
  • Figure 2: Overview of the proposed Multilateral Temporal-view Pyramid Transformer. See text for details.
  • Figure 3: (a) Diagram and (b) process of DWTI.
  • Figure 3: Cross-dataset performance of different methods from DVI to FVI dataset (DVI $\rightarrow$ FVI).
  • Figure 4: Temporal-view Feature Fusing (TFF) block.
  • ...and 3 more figures