Table of Contents
Fetching ...

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Pengfei Wan, Yu Wang, Xihui Liu

TL;DR

CineScene tackles cinematic video generation with decoupled scene context by injecting implicit 3D scene representations into a pretrained text-to-video diffusion model. It leverages VGGT to extract 3D-aware features from a static scene and conditions generation with both scene context and camera trajectory, avoiding explicit geometry and enabling large view changes. A simple shuffled-context training strategy and a scene-decoupled Unreal Engine 5 dataset support robust learning, and experiments show state-of-the-art scene consistency and camera accuracy with good generalization. The approach promises practical benefits for virtual production and cinematic storytelling by enabling dynamic subjects within stable scene layouts under flexible camera motions.

Abstract

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

TL;DR

CineScene tackles cinematic video generation with decoupled scene context by injecting implicit 3D scene representations into a pretrained text-to-video diffusion model. It leverages VGGT to extract 3D-aware features from a static scene and conditions generation with both scene context and camera trajectory, avoiding explicit geometry and enabling large view changes. A simple shuffled-context training strategy and a scene-decoupled Unreal Engine 5 dataset support robust learning, and experiments show state-of-the-art scene consistency and camera accuracy with good generalization. The approach promises practical benefits for virtual production and cinematic storytelling by enabling dynamic subjects within stable scene layouts under flexible camera motions.

Abstract

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
Paper Structure (23 sections, 8 figures, 10 tables)

This paper contains 23 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Examples generated by CineScene. Given multiple images of a static environment, prompt, and a user-defined camera trajectory, the model generates high-quality videos featuring dynamic subject, while preserving the underlying scene in large view changes.
  • Figure 2: Overview of Scene-Decoupled Video Dataset. For each scene, we render (a) videos with/without dynamic subject, (b) 360° panoramic image representing the static scene from a common starting viewpoint, with (c) diverse camera trajectories.
  • Figure 3: Overview of CineScene.Left: Our method, CineScene, injects implicit 3D information as a context condition. Features from VGGT are encoded as tokens ($F_t$) and concatenated with the scene images ($I_t$) and the noisy video latents. This architecture fundamentally decouples the static background (the condition) from the dynamic foreground (the generation target). Right: In contrast, loss-guided approaches use the VGGT features to form a supervisory loss, which penalizes deviations from the static scene and thus discourages dynamic content generation. We omit the text prompt for simplicity.
  • Figure 4: Qualitative comparison of CineScene and previous context-based, explicit 3D guidance, camera-controlled methods. We present dynamic scenes, static scenes compared with FramePack zhang2025framepack, CaM yu2025contextasmemory, and Gen3C ren2025gen3c, camera-control with Traj-Attn xiao2025trajectory and RecamMaster bai2025recammaster. We provide scene ground truth (gt) for comparison. We only show 4 scene context images for illustration.
  • Figure 5: Qualitative ablation study on injecting implicit 3D methods. Loss-guided method shows artifacts when generating dynamic subject.
  • ...and 3 more figures