Table of Contents
Fetching ...

Synthetic Video Enhances Physical Fidelity in Video Synthesis

Qi Zhao, Xingyu Ni, Ziyu Wang, Feng Cheng, Ziyan Yang, Lu Jiang, Bohan Wang

TL;DR

The paper addresses the gap between visually convincing yet physically inconsistent videos and the need for physically faithful video synthesis. It proposes a data-centric approach that leverages CGI-generated videos from Blender and Unreal Engine, coupled with a diffusion-transformer model and the SimDrop technique to suppress synthetic artifacts while preserving physical realism. Through three representative tasks—large human motion, wide-camera rotations, and layer decomposition—the method demonstrates improved 3D consistency, pose integrity, and foreground-background separation, outperforming several baselines and commercial models. This work highlights the practical potential of synthetic video data to enhance physical fidelity in video synthesis and sets the stage for richer supervisory signals and physics-aware training in the future.

Abstract

We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: https://kevinz8866.github.io/simulation/

Synthetic Video Enhances Physical Fidelity in Video Synthesis

TL;DR

The paper addresses the gap between visually convincing yet physically inconsistent videos and the need for physically faithful video synthesis. It proposes a data-centric approach that leverages CGI-generated videos from Blender and Unreal Engine, coupled with a diffusion-transformer model and the SimDrop technique to suppress synthetic artifacts while preserving physical realism. Through three representative tasks—large human motion, wide-camera rotations, and layer decomposition—the method demonstrates improved 3D consistency, pose integrity, and foreground-background separation, outperforming several baselines and commercial models. This work highlights the practical potential of synthetic video data to enhance physical fidelity in video synthesis and sets the stage for richer supervisory signals and physics-aware training in the future.

Abstract

We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: https://kevinz8866.github.io/simulation/

Paper Structure

This paper contains 22 sections, 1 equation, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Our synthetic-data-enhanced video generation model is capable of producing videos depicting human dancing (rows 1), scenes featuring large camera orbiting around the object (row 2), and animals against solid-color backgrounds for matting (row 3).
  • Figure 2: Visualization of the pipeline to augment video generation model with synthetic video data. We first plan the synthetic videos and generation descriptive tags for each elements (e.g. object, character, motion, etc). Then we combine the element descriptions to form the caption for synthetic videos. During training, we mix the synthetic videos with real-world video data to improve physics fidelity in challenging video generation tasks.
  • Figure 3: Visualizations of synthetic videos highlighting both good- and poor-quality 3D assets (a) and rendering (b).
  • Figure 4: Visualizations of the videos generated by our improved model, trained using synthetic data. Rows 1,2 highlight wide-angle camera motion; rows 3 display layer decomposition; and rows 4,5,6 demonstrate large human motion.
  • Figure 5: Visualization of video frames with large human motion generated by our model. The shadow of human body follows the human motion.
  • ...and 10 more figures