Table of Contents
Fetching ...

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.
Paper Structure (34 sections, 13 figures, 3 tables)

This paper contains 34 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of RealMaster. Our method consists of two stages: (1) Synthetic-to-Realistic Data Generation: Given a synthetic video, we edit sparse keyframes and propagate their appearance across the sequence using VACE, conditioned on edge maps from the input video, to create paired synthetic–realistic training data. (2) Model Training: We fine-tune an IC-LoRA over a text-to-video diffusion model on the paired data, enabling direct sim-to-real video translation at inference time.
  • Figure 2: Qualitative Results. We show representative GTA-V video sequences together with their edits produced by our method. These translated sequences demonstrate our method’s ability to produce photorealistic video while maintaining strict temporal coherence. Note the consistent appearance of materials, lighting, and fine details across frames. Best viewed zoomed in.
  • Figure 3: Qualitative comparison with baseline methods. We compare our method against Runway-Aleph, LucyEdit, and Editto on three videos from the benchmark. The baselines either alter the original scene content, leading to identity drift and color shifts, or fail to produce sufficiently photorealistic results. In contrast, our method preserves scene structure and identity while improving the photorealism.
  • Figure 4: User study. We report the percentage of trials where participants preferred RealMaster over each baseline for realism, faithfulness to the original video, and overall visual quality.
  • Figure 5: Data generation ablation. We ablate sparse-to-dense propagation for training pair generation, comparing multiple-anchor editing, depth conditioning, and edge conditioning for VACE. Multiple-anchor editing leads to temporal flickering and fluctuations in identity. Depth conditioning loses facial expression and facial structure, often failing to preserve identity. In contrast, edge conditioning preserves facial details more reliably and produces the most stable results across the sequence.
  • ...and 8 more figures