Table of Contents
Fetching ...

Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim, Qifeng Chen

TL;DR

This work tackles the domain gap between synthetic driving videos and real-world footage by proposing a zero-shot, structure-aware denoising framework that enhances photorealism while preserving source content. It builds on a pre-trained diffusion video model (Cosmos-transfer) and uses DDIM inversion plus multi-modal conditioning (depth, semantic, edge maps) through a ControlNet to guide denoising under a realism-promoting prompt. The main contributions are: (i) a zero-shot inversion-generation pipeline that anchors to the original video, (ii) a structure-aware denoising strategy that maintains semantic identity of small objects such as traffic lights and road signs, and (iii) a rigorous evaluation protocol for object-level consistency, LPIPS, and video quality in synthetic-to-real enhancement. The results show improved structural consistency and competitive photorealism compared with baselines on CARLA-based sequences, enabling more realistic synthetic data without task-specific training. This has practical impact for data augmentation and safety-critical scenario coverage in autonomous driving research.

Abstract

We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.

Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

TL;DR

This work tackles the domain gap between synthetic driving videos and real-world footage by proposing a zero-shot, structure-aware denoising framework that enhances photorealism while preserving source content. It builds on a pre-trained diffusion video model (Cosmos-transfer) and uses DDIM inversion plus multi-modal conditioning (depth, semantic, edge maps) through a ControlNet to guide denoising under a realism-promoting prompt. The main contributions are: (i) a zero-shot inversion-generation pipeline that anchors to the original video, (ii) a structure-aware denoising strategy that maintains semantic identity of small objects such as traffic lights and road signs, and (iii) a rigorous evaluation protocol for object-level consistency, LPIPS, and video quality in synthetic-to-real enhancement. The results show improved structural consistency and competitive photorealism compared with baselines on CARLA-based sequences, enabling more realistic synthetic data without task-specific training. This has practical impact for data augmentation and safety-critical scenario coverage in autonomous driving research.

Abstract

We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.

Paper Structure

This paper contains 38 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Enhanced videos by our structure-aware denoising method from rendered videos. Our approach could generate the videos with structural consistency and state-of-the-art photorealism quality, especially for the small objects in autonomous driving scenarios, such as roadside signals and traffic lights.
  • Figure 2: Overview of our pipeline on synthetic video realism enhancement. We inject DDIM inversion into the ControlNet version of the world model agarwal2025cosmos in a zero-shot structure-aware denoising manner, aiming to improve structural consistency while maintaining photorealism.
  • Figure 3: Qualitative results of our methods on synthetic video realism enhancement. We could maintain the photorealism and improve the structural consistency in the diverse outdoor conditions.
  • Figure 4: Examples of our method on temporal alignments. Our method could maintain the change of traffic lights simultaneously as well as improve the photorealism of the enhanced videos, such as the shadows of cars. Best view to zoom in.
  • Figure 5: More examples of our methods on synthetic videos from GTA richter2016playing. We show enhanced videos every 20 frames temporally. We could maintain the photorealism and improve the structural consistency in the diverse outdoor conditions.