Table of Contents
Fetching ...

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi

TL;DR

This work introduces Learned 3D Evaluation (L3DE), a data-driven framework that quantifies 3D visual coherence in AI-generated videos by leveraging monocular cues of appearance, motion, and geometry extracted from foundation models. A 3D CNN classifier trained with a contrastive objective differentiates real versus synthetic videos, with Grad-CAM localization providing interpretable region-level explanations and a fusion module enabling holistic scoring. Validations through 3D reconstruction and human judgments show strong alignment with both objective rendering quality and perceptual realism, revealing persistent gaps in current generators, especially in motion and geometry. L3DE then benchmarks multiple models, offers applications as a deepfake detector and a tool for artifact-guided refinement, and demonstrates the potential to guide future improvements in video synthesis toward more faithful 3D world simulation.

Abstract

Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

TL;DR

This work introduces Learned 3D Evaluation (L3DE), a data-driven framework that quantifies 3D visual coherence in AI-generated videos by leveraging monocular cues of appearance, motion, and geometry extracted from foundation models. A 3D CNN classifier trained with a contrastive objective differentiates real versus synthetic videos, with Grad-CAM localization providing interpretable region-level explanations and a fusion module enabling holistic scoring. Validations through 3D reconstruction and human judgments show strong alignment with both objective rendering quality and perceptual realism, revealing persistent gaps in current generators, especially in motion and geometry. L3DE then benchmarks multiple models, offers applications as a deepfake detector and a tool for artifact-guided refinement, and demonstrates the potential to guide future improvements in video synthesis toward more faithful 3D world simulation.

Abstract

Recent advancements in video diffusion models enable the generation of photorealistic videos with impressive 3D consistency and temporal coherence. However, the extent to which these AI-generated videos simulate the 3D visual world remains underexplored. In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos' ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. Instead of relying on 3D reconstruction, which is prone to failure with in-the-wild videos, L3DE employs a 3D convolutional network, trained on monocular 3D cues of motion, depth, and appearance, to distinguish real from synthetic videos. Confidence scores from L3DE quantify the gap between real and synthetic videos in terms of 3D visual coherence, while a gradient-based visualization pinpoints unrealistic regions, improving interpretability. We validate L3DE through extensive experiments, demonstrating strong alignment with 3D reconstruction quality and human judgments. Our evaluations on leading generative models (e.g., Kling, Sora, and MiniMax) reveal persistent simulation gaps and subtle inconsistencies. Beyond generative video assessment, L3DE extends to broader applications: benchmarking video generation models, serving as a deepfake detector, and enhancing video synthesis by inpainting flagged inconsistencies. Project page: https://justin-crchang.github.io/l3de-project-page/
Paper Structure (37 sections, 5 equations, 15 figures, 10 tables)

This paper contains 37 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: L3DE evaluates videos from any generative model based on 3D visual coherence, assessing appearance, motion, and geometry. Its scores align closely with human perception and can localize regions of 3D simulation failures, similar to human intuition. Examples highlight key failure cases: (1) incorrect occlusion between the basketball and hoop, disrupting geometric consistency, (2) abrupt texture transition in plant leaves, and (3) unnatural relative motion between the golf ball and the golf club, violating real-world motion dynamics.
  • Figure 2: Illustration of the statistics of activation value, pixel value error and the distribution of pixel number for each proxy.
  • Figure 3: Frames and reconstruction results of twin videos. Even though synthetic videos appear plausible, they do not achieve the same level of 3D scene reconstruction accuracy as real videos (see the Shrunken Gaussians in the rightmost column). This discrepancy underscores a key limitation: current generative videos are not yet adept at faithfully simulating the world in terms of 3D visual coherence.
  • Figure 4: Illustration of 3D inconsistencies identified by L3DE. From left to right: (a) AI-generated video frame; (b) rendered frame with 3D reconstruction with pose aligned with the original view; (c) pixel-level difference between (a) and (b); (d) Grad-CAM result from the L3DE network, which closely aligns with (c); (e) Blue solid line: large (normalized) activation value in (d) is highly aligned with large mean pixel value error in (c). Green dashed line: areas with high (normalized) activation values cover only a small portion of the entire frame. L3DE identifies key artifacts in the cases: (1) unnatural hand motion in the first case, reflected in a low motion score of 0.4642; (2) abrupt geometric deformation of the marked object in the second case, with a geometry score of 0.637; and (3) sudden texture changes in the chair and table in the third case, resulting in an appearance score of 0.2578.
  • Figure 5: The correlation between L3DE scores and human ratings. The X-axis represents the average human ratings and the Y-axis represents the L3DE scores.
  • ...and 10 more figures