Table of Contents
Fetching ...

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng, Lei Bai, Xiaoming Liu, Guangtao Zhai

TL;DR

MM-Det++ tackles diffusion-generated video detection by coupling a Spatio-Temporal FC-ViT branch with a Multimodal branch that leverages Multimodal Large Language Models for reasoning. A Unified Multimodal Learning module fuses these signals into a coherent representation, improving generalization across unseen diffusion methods. The Diffusion Video Forensics (DVF) dataset enables thorough evaluation in open-world settings and under post-processing attacks. Empirical results show state-of-the-art performance and robustness, highlighting the value of unified multimodal forgery learning for practical video forensics.

Abstract

The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

TL;DR

MM-Det++ tackles diffusion-generated video detection by coupling a Spatio-Temporal FC-ViT branch with a Multimodal branch that leverages Multimodal Large Language Models for reasoning. A Unified Multimodal Learning module fuses these signals into a coherent representation, improving generalization across unseen diffusion methods. The Diffusion Video Forensics (DVF) dataset enables thorough evaluation in open-world settings and under post-processing attacks. Empirical results show state-of-the-art performance and robustness, highlighting the value of unified multimodal forgery learning for practical video forensics.

Abstract

The proliferation of videos generated by diffusion models has raised increasing concerns about information security, highlighting the urgent need for reliable detection of synthetic media. Existing methods primarily focus on image-level forgery detection, leaving generic video-level forgery detection largely underexplored. To advance video forensics, we propose a consolidated multimodal detection algorithm, named MM-Det++, specifically designed for detecting diffusion-generated videos. Our approach consists of two innovative branches and a Unified Multimodal Learning (UML) module. Specifically, the Spatio-Temporal (ST) branch employs a novel Frame-Centric Vision Transformer (FC-ViT) to aggregate spatio-temporal information for detecting diffusion-generated videos, where the FC-tokens enable the capture of holistic forgery traces from each video frame. In parallel, the Multimodal (MM) branch adopts a learnable reasoning paradigm to acquire Multimodal Forgery Representation (MFR) by harnessing the powerful comprehension and reasoning capabilities of Multimodal Large Language Models (MLLMs), which discerns the forgery traces from a flexible semantic perspective. To integrate multimodal representations into a coherent space, a UML module is introduced to consolidate the generalization ability of MM-Det++. In addition, we also establish a large-scale and comprehensive Diffusion Video Forensics (DVF) dataset to advance research in video forgery detection. Extensive experiments demonstrate the superiority of MM-Det++ and highlight the effectiveness of unified multimodal forgery learning in detecting diffusion-generated videos.

Paper Structure

This paper contains 25 sections, 11 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Comparison between single-domain and our multimodal detection frameworks. Conventional single-domain detection primarily extracts spatio-temporal traces from videos. In contrast, we propose a multimodal detection that captures forgery traces across multiple modalities. By harnessing the perceptual and reasoning capabilities of a Multimodal Large Language Model (MLLM), our framework demonstrates enhanced effectiveness in detecting the diffusion-generated videos.
  • Figure 2: (a) We adopt a Multimodal Large Language Model (MLLM) liu2024visual to the forgery detection task. Given the language instruction $\mathbf{Q}$ and image $\mathbf{I}$ as inputs, the MLLM generates detailed and persuasive textual Answer Reasoning ($\mathbf{AR}$) to determine whether $\mathbf{I}$ is generated with AI techniques. (b) Comparison between the framework of MM-Det song2024onlearning and MM-Det++. While MM-Det incorporates visual representation $\mathbf{H}_{v}$ and reasoning representation $\mathbf{H}_{r}$ from the Large Language Model (LLM) through the multi-round prediction for forgery detection, MM-Det++ advances designs in two aspects. First, MM-Det++ replaces the time-consuming prediction with a learnable reasoning process for efficient LLM reasoning. By introducing an additional learnable reasoning token $\mathbf{H}_{lr}$ into the instruction prompt, MM-Det++ extracts reasoning-relevant information in a single prediction, significantly enhancing the efficiency. Secondly, MM-Det++ proposes a Unified Multimodal Learning (UML) module to aggregate cross-modality forgery information into a unified multimodal representation $\mathbf{H}_{um}$. (c) Compared with MM-Det, MM-Det++ exhibits improved efficiency and effectiveness during inference with MLLM.
  • Figure 3: The overall structure of MM-Det++. It is a multimodal dual-branch detector that consists of a Spatio-Temporal (ST) branch, a Multimodal (MM) branch, and a Unified Multimodal Learning (UML) module to output the final prediction.
  • Figure 4: In FC-ViT, two self-attention operations are employed. Each input frame is first encoded and partitioned into Patch tokens (P-tokens) () to aggregate patch-level temporal information, while additional Frame-Centric token (FC-token) () are introduced to aggregate frame-level spatial information, producing the spatio-temporal forgery representation $\mathbf{H}_{ST}$.
  • Figure 5: The learnable reasoning process in the MM branch of MM-Det++. We introduce a Learnable Reasoning token (LR-Token) to capture reasoning representations from the MLLM. Given a key frame $\mathbf{x}^{k}$, the embedding of textual tokens $\mathbf{H}_{t}$, the aligned visual tokens $\mathbf{\hat{H}}_{v}$, and the LR-token $\mathbf{H}_{lr}$ (fixed during inference) are concatenated and fed into a frozen LLM transformer. At last, the context-aggregated LR-token $\mathbf{H}^{\prime}_{lr}$ is produced from $\mathcal{D}_{L}$ to facilitate reasoning over the diffusion-generated videos.
  • ...and 9 more figures