Table of Contents
Fetching ...

3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

Bhavik Chandna, Kelsey R. Allen

TL;DR

The results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.

Abstract

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.

3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism

TL;DR

The results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations.

Abstract

AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.
Paper Structure (19 sections, 1 equation, 9 figures, 7 tables)

This paper contains 19 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Example 3D point tracks (projected into 2D) reconstructed by 3DSPA for an unrealistic generated video from VideoPhy-2 (video frames progress from left to right, top to bottom) bansal2025videophy, depicting a man striking a wall with a hammer and clearly violating physical laws. 3DSPA assigns low realism as measured with the Average Jaccard ($AJ$) (blue: high, red: low, white: intermediate). The video ranks among the lowest in both 3DSPA ’s scores and human realism ratings, highlighting strong alignment with human judgment.
  • Figure 2: 3DSPA architecture overview. The encoder integrates 3D trajectories, temporal embeddings, and DINOv2 oquab2023dinov2 semantic features into a compact latent representation using occlusion-aware attention and a Perceiver-style transformer architecture jaegle2021perceiver. The decoder conditions on query points to reconstruct full 3D trajectories with occlusion flags.
  • Figure 3: Qualitative examples of accurate reconstruction on realistic generated videos from EvalCrafter and VideoPhy-2. High-rated videos produce coherent and stable point tracks under 3DSPA (blue). The top and middle examples are from EvalCrafter liu2024evalcrafter, and the bottom example is from VideoPhy-2 bansal2025videophy.
  • Figure 4: TRAJAN vs. 3DSPA . Qualitative comparison on videos from the EvalCrafter dataset liu2024evalcrafter. Compared to TRAJAN allen2025direct, 3DSPA produces more coherent and temporally stable point tracks, aligning more closely with human judgments of motion quality. In the top example (dog walking; human rating: 4.5/5), 3DSPA accurately captures articulated leg motion in 3D, whereas TRAJAN produces noisy and inconsistent tracks. In the bottom example (human rating: 1.67/5), the phone gradually disappears; 3DSPA correctly identifies this semantic violation, while TRAJAN fails due to smooth but semantically implausible trajectories.
  • Figure 5: Performance comparison across models on the IntPhys2 benchmark for each of the easy, medium and hard categories.
  • ...and 4 more figures