Table of Contents
Fetching ...

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, Chris Callison-Burch

TL;DR

This work addresses the challenge of aligning video-generation evaluation with human perception by introducing DeeptraceReward, a fine-grained benchmark of human-perceived deepfake traces in AI-generated videos. It aggregates 4,334 trace annotations across 3,318 fake videos and 3,318 real videos, categorized into 9 trace types with spatiotemporal localization and explanations. Through evaluations of 13 multimodal LLMs and supervised fine-tuning, the authors demonstrate a large gap between binary fake/real detection and fine-grained trace grounding, and show that a dedicated 7B reward model trained on this dataset achieves significant gains (e.g., 70.2% overall, vs. GPT-5 by 34.7%). The results highlight the importance of human-grounded signals for trustworthy video generation and provide a rigorous testbed for future models that better mimic human visual judgments.

Abstract

Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

TL;DR

This work addresses the challenge of aligning video-generation evaluation with human perception by introducing DeeptraceReward, a fine-grained benchmark of human-perceived deepfake traces in AI-generated videos. It aggregates 4,334 trace annotations across 3,318 fake videos and 3,318 real videos, categorized into 9 trace types with spatiotemporal localization and explanations. Through evaluations of 13 multimodal LLMs and supervised fine-tuning, the authors demonstrate a large gap between binary fake/real detection and fine-grained trace grounding, and show that a dedicated 7B reward model trained on this dataset achieves significant gains (e.g., 70.2% overall, vs. GPT-5 by 34.7%). The results highlight the importance of human-grounded signals for trustworthy video generation and provide a rigorous testbed for future models that better mimic human visual judgments.

Abstract

Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Human-perceived deepfake traces examples. The shown cases are selected from Pika 1.5, MiniMax-Video-01, and Sora generated videos. For each deepfake trace, we annotate local bounding box regions, start and end timestamps, and provide natural language explanation. All fake trace categories are summarized in \ref{['sec:data_analyses']} and distribution can be found in \ref{['fig:category_statistics']}.
  • Figure 2: DeeptraceReward data curation pipeline. Selected videos are uploaded to our annotation platform LabelBox labelbox, where experts provide fine-grained deepfake trace annotations with bounding boxes, textual explanations, and start / end timestamps.
  • Figure 3: Labelbox annotation interface. Each video is annotated with localized bounding boxes that highlight specific regions across frames where fakeness is perceived. Each annotated deepfake trace is accompanied by a natural language explanation and predefined category labels.
  • Figure 4: DeeptraceReward deepfake trace category statistics. Category definitions can be found in \ref{['sec:exp_analyses']}, and concrete examples for each category are listed in \ref{['fig:example1', 'fig:example2']}.
  • Figure 5: Performance analysis between baseline models and our best reward model trained on the collected DeeptraceReward dataset. Our model is much better in all categories, especially in "object spltting" and "object merging".
  • ...and 2 more figures