Table of Contents
Fetching ...

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang

TL;DR

This work targets the detection of human-centric AI-generated videos by introducing AvatarShield, a multimodal LLM-based framework trained with Group Relative Policy Optimization to produce interpretable reasoning from simple binary labels. A dual-encoder vision module (semantic via discrete vision tower and residual via VQ-VAE) captures both high-level inconsistencies and fine-grained artifacts, guiding an LLM to output detection decisions and explanations. The FakeHumanVid benchmark, spanning 9 generation methods and 15k clips, enables comprehensive in-domain and cross-domain evaluation, where AvatarShield achieves state-of-the-art performance and strong generalization. Ablation studies demonstrate that each architectural component and the GRPO-based training contribute to robust temporal modeling and artifact detection, highlighting practical implications for media forensics and digital safety.

Abstract

Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

TL;DR

This work targets the detection of human-centric AI-generated videos by introducing AvatarShield, a multimodal LLM-based framework trained with Group Relative Policy Optimization to produce interpretable reasoning from simple binary labels. A dual-encoder vision module (semantic via discrete vision tower and residual via VQ-VAE) captures both high-level inconsistencies and fine-grained artifacts, guiding an LLM to output detection decisions and explanations. The FakeHumanVid benchmark, spanning 9 generation methods and 15k clips, enables comprehensive in-domain and cross-domain evaluation, where AvatarShield achieves state-of-the-art performance and strong generalization. Ablation studies demonstrate that each architectural component and the GRPO-based training contribute to robust temporal modeling and artifact detection, highlighting practical implications for media forensics and digital safety.

Abstract

Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.

Paper Structure

This paper contains 21 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We focus on the human-centric generation video detection problem, constructing a human synthetic detection dataset FakeHumanVid, along with an efficient reasoning-style multi-modal large language model AvatarShield. Our FakeHumanVid dataset encompasses 9 different pose-driven, text-driven, and audio-driven video generation methods. The proposed AdavarShield significantly outperforms existing mainstream LLMs in terms of detection accuracy and reasoning capabilities.
  • Figure 2: Construction process and data distribution of our proposed FakeHumanVid.
  • Figure 3: Some sample videos in our FakeHumanVid dataset.
  • Figure 4: Illustration of the proposed AvatarShield. Our method takes text instructions as input through a text embedding layer and processes the video using a dual-encoder architecture, guiding the LLM to generate detection results along with reasoning outcomes. Then, under the GRPO framework, we jointly optimize the entire network through the detection accuracy reward, temporal compensation reward, format reward, and length reward, achieving precise and interpretable synthetic video detection.
  • Figure 5: Comparison results between our method and other LLM-based methods. While Qwen-SFT can only output binary real-or-fake judgments, and MM-Det can only provide fake information analysis for each frame, our method not only delivers accurate detection results but also provides a detailed and transparent reasoning process.
  • ...and 9 more figures