CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Zhiyi Kuang; Chengan He; Egor Zakharov; Yuxuan Xue; Shunsuke Saito; Olivier Maury; Timur Bagautdinov; Youyi Zheng; Giljoo Nam

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Zhiyi Kuang, Chengan He, Egor Zakharov, Yuxuan Xue, Shunsuke Saito, Olivier Maury, Timur Bagautdinov, Youyi Zheng, Giljoo Nam

Abstract

We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Abstract

Paper Structure (28 sections, 5 equations, 6 figures, 2 tables)

This paper contains 28 sections, 5 equations, 6 figures, 2 tables.

Introduction
Related Work
Novel View Synthesis from Sparse Inputs.
Image and Video Relighting.
Multimodal Diffusion Models.
Methodology
Model Design
Diffusion Backbone.
Latent Embedding with Environment Maps.
Multimodal Conditioning with Camera Poses.
Context-Guided Diffusion.
Data Curation
Camera Pose Normalization.
Training and Inference
Training.
...and 13 more sections

Figures (6)

Figure 1: CamLit, a unified video diffusion model with joint camera and lighting control. Given a single image, CamLit generates a novel view video, a paired relit video, and a paired albedo video under user-defined camera trajectory and lighting conditions with high fidelity.
Figure 2: An illustration of CamLit pipeline. At the core of our framework is a multi-modal video DiT. This model takes as input a single RGB image, a camera trajectory, and an environment map. From these inputs, the model simultaneously generates a spatially and temporally aligned triplet of videos: (i) an RGB novel-view sequence under the same illumination as the input image, (ii) the corresponding relit sequence (with full shading from the environment map), and (iii) an albedo sequence capturing the scene’s intrinsics without shading.
Figure 3: Video generation examples of CamLit. For each example, we visualize two camera trajectories, moving backward and turning right as indicated by the arrows, to reveal generated content in unseen regions. From left to right, we show the input image, a novel view frame under the original lighting, the corresponding albedo, and three relit novel view frames. The environment maps used for relighting are shown in the insets.
Figure 4: Qualitative comparison of novel view synthesis methods. For each input image, we apply two camera trajectories, moving backward ($1$st row) and turning right ($2$nd row), as indicated by the arrows. Our model, which performs NVS and relighting jointly, achieves NVS quality on par with state-of-the-art methods specifically dedicated to NVS.
Figure 5: Qualitative comparison of relighting methods. Our approach produces albedo and relit videos with quality comparable to DiffusionRenderer DiffusionRenderer, which is our theoretical performance upper bound. The environment maps used for relighting are shown in the middle insets.
...and 1 more figures

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Abstract

CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control

Authors

Abstract

Table of Contents

Figures (6)