Table of Contents
Fetching ...

MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence

Tai D. Nguyen, Matthew C. Stamm

TL;DR

MVFNet addresses the challenge of detecting and localizing diverse video forgeries without prior knowledge of the manipulation type. It combines spatial forensic residuals, RGB context, temporal forensic residuals, and optical-flow residuals, and processes them with a Multi-Scale Hierarchical Transformer to capture inconsistencies across scales and modalities. The approach introduces new modalities and a targeted pretraining loss, achieving state-of-the-art performance in multi-manipulation scenarios and competitive results against specialized detectors in single-manipulation tasks. The Unified Video Forgery Analysis dataset, along with UVFA-IND, UVFA-OOD, and VideoSham, provides a rigorous benchmark for evaluating generalization to unseen forgeries, highlighting MVFNet’s robustness and practical impact for broad video authentication needs.

Abstract

While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.

MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence

TL;DR

MVFNet addresses the challenge of detecting and localizing diverse video forgeries without prior knowledge of the manipulation type. It combines spatial forensic residuals, RGB context, temporal forensic residuals, and optical-flow residuals, and processes them with a Multi-Scale Hierarchical Transformer to capture inconsistencies across scales and modalities. The approach introduces new modalities and a targeted pretraining loss, achieving state-of-the-art performance in multi-manipulation scenarios and competitive results against specialized detectors in single-manipulation tasks. The Unified Video Forgery Analysis dataset, along with UVFA-IND, UVFA-OOD, and VideoSham, provides a rigorous benchmark for evaluating generalization to unseen forgeries, highlighting MVFNet’s robustness and practical impact for broad video authentication needs.

Abstract

While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Examples of videos falsified using several different manipulations, alongside multiple forms of forensic evidence gathered by our network, and forgery localization masks produced by our network.
  • Figure 2: Overview of MVFNet. Our network extracts different types (modalities) of forensic evidence: spatial forensic residuals, RGB context, and temporal forensic residuals, and optical flow residuals. Next, our network jointly analyzes all evidence using a multi-scale hierarchical transformer module. Finally, dedicated subnetworks produce final detection scores and localization masks.
  • Figure 3: Overview of our multi-scale hierarchical transformer module. This is made using a series of adaptive poolings, resolution-aware connectors, and transformers at multiple scales. This module is designed so that information flows in a coarse-to-fine manner.
  • Figure 4: Localization results from our proposed network as well as VideoFACT VideoFACT, VIDNet VIDNet, DVIL DVIL, MVSS-Net MVSS-Net, ManTra-Net ManTra-Net, and FSG FSG on 4 different manipulation types in the UVFA-IND dataset. We note that we do not provide localization results for deepfake detectors because these algorithms only perform detection.
  • Figure 5: Effect of video compression on detection and localization performance in the UVFA-IND dataset.