Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection
Ying Zhang, Yuezun Li, Bo Peng, Jiaran Zhou, Huiyu Zhou, Junyu Dong
TL;DR
This work tackles the problem of detecting pixel-level inpainting in videos, a forensic priority as generative methods become more realistic. It introduces MumPy, a Multilateral Temporal-view Pyramid Transformer that flexibly fuses spatial and temporal cues through a Multilateral Temporal-view Encoder, a Deformable Window-based Temporal-view Interaction, and a Multi-pyramid Decoder. A new large-scale dataset, YTVI, based on YouTube-VOS, is provided to enable robust cross-domain evaluation across multiple modern inpainting methods. On DVI, FVI, and YTVI benchmarks, MumPy achieves state-of-the-art $mIoU$ and $F1$ scores, demonstrating strong cross-domain generalization and the value of adaptive spatial-temporal clue integration for detection.
Abstract
The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
