AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection
Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang
TL;DR
This work targets the detection of human-centric AI-generated videos by introducing AvatarShield, a multimodal LLM-based framework trained with Group Relative Policy Optimization to produce interpretable reasoning from simple binary labels. A dual-encoder vision module (semantic via discrete vision tower and residual via VQ-VAE) captures both high-level inconsistencies and fine-grained artifacts, guiding an LLM to output detection decisions and explanations. The FakeHumanVid benchmark, spanning 9 generation methods and 15k clips, enables comprehensive in-domain and cross-domain evaluation, where AvatarShield achieves state-of-the-art performance and strong generalization. Ablation studies demonstrate that each architectural component and the GRPO-based training contribute to robust temporal modeling and artifact detection, highlighting practical implications for media forensics and digital safety.
Abstract
Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.
