How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi
TL;DR
This work introduces Learned 3D Evaluation (L3DE), a data-driven framework that quantifies 3D visual coherence in AI-generated videos by leveraging monocular cues of appearance, motion, and geometry extracted from foundation models. A 3D CNN classifier trained with a contrastive objective differentiates real versus synthetic videos, with Grad-CAM localization providing interpretable region-level explanations and a fusion module enabling holistic scoring. Validations through 3D reconstruction and human judgments show strong alignment with both objective rendering quality and perceptual realism, revealing persistent gaps in current generators, especially in motion and geometry. L3DE then benchmarks multiple models, offers applications as a deepfake detector and a tool for artifact-guided refinement, and demonstrates the potential to guide future improvements in video synthesis toward more faithful 3D world simulation.
Abstract
Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/
