Table of Contents
Fetching ...

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Peijun Bao, Anwei Luo, Gang Pan, Alex C. Kot, Xudong Jiang

Abstract

Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

Abstract

Temporal forgery localization aims to temporally identify manipulated segments in videos. Most existing benchmarks focus on appearance-level forgeries, such as face swapping and object removal. However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To overcome this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in videos. It contains over 6K forged video segments that are seamlessly blended into the video context, rendering high visual consistency that makes them almost indistinguishable from authentic content to the human eye. We further propose Temporal Artifact Diffuser (TADiff), a simple yet effective baseline that exposes artifact cues through a diffusion-based feature regularizer. Based on ActivityForensics, we introduce comprehensive evaluation protocols covering intra-domain, cross-domain, and open-world settings, and benchmark a wide range of state-of-the-art forgery localizers to facilitate future research. The dataset and code are available at https://activityforensics.github.io.

Paper Structure

This paper contains 18 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: a) Existing datasets for temporal forgery localization mainly focus on appearance-level forgeries such as object removal and face manipulation. b) Driven by the remarkable advances in video generation and editing in recent years, however, activity-level forgeries have become increasingly prevalent and pose significant risks to media integrity and societal trust. c) To address this emerging threat, we present ActivityForensics, the first dataset for localizing manipulated activities in videos.
  • Figure 2: Overview of grounding-assisted data generation pipeline. 1) We leverage video captioning and temporal grounding to obtain activity descriptions and localize their corresponding temporal segments. 2) Subsequently, grounded segments and manipulated descriptions are harnessed as conditioning signals to automatically perform activity manipulations. 3) The manipulated segments are finally seamlessly merged into the rest of the video, while remaining visually consistent across both tampered and authentic regions. The green bounding boxes indicate the original regions, while the red ones correspond to the manipulated regions.
  • Figure 3: Statistics of the ActivityForensics dataset. a) Histogram of forgery-segment counts across manipulation methods, where vidu is used only for evaluation. b) Distribution of manipulated segment durations. c) Distribution of the ratio between manipulated segment duration and overall video duration.
  • Figure 4: Overview of Temporal Artifact Diffuser (TADiff). Different from action localization that relies on high-level semantics for event understanding, manipulated activity localization requires sensitivity to subtle temporal and visual artifacts. To this end, TADiff injects stochastic perturbations into the temporal feature space of ActionFormer to suppress semantic bias, and then amplifies artifact cues via iterative denoising, composed of Feature-wise Linear Modulation (FiLM) and Denoising Diffusion Implicit Model (DDIM) updates.
  • Figure 5: Impact of denoising step number.
  • ...and 3 more figures