Table of Contents
Fetching ...

GenLit: Reformulating Single-Image Relighting as Video Generation

Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, Michael J. Black

TL;DR

GenLit reframes single-image relighting as controllable video synthesis by keeping scene geometry static and animating lighting via a moving point light. It fine-tunes a controllable version of Stable Video Diffusion with a 5D lighting vector to synthesize relight sequences, obviating explicit inverse-rendering pipelines. A new Objaverse-GenLit synthetic dataset (1436 assets) enables training across single- and multi-object scenes, with demonstrated generalization to real images and MIT Multi-Illumination data, producing plausible shadows and diffuse interreflections. The work highlights the potential of video foundation models to capture light, material, and shape interactions for realistic, controllable relighting with minimal explicit 3D asset reconstruction or ray tracing.

Abstract

Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.

GenLit: Reformulating Single-Image Relighting as Video Generation

TL;DR

GenLit reframes single-image relighting as controllable video synthesis by keeping scene geometry static and animating lighting via a moving point light. It fine-tunes a controllable version of Stable Video Diffusion with a 5D lighting vector to synthesize relight sequences, obviating explicit inverse-rendering pipelines. A new Objaverse-GenLit synthetic dataset (1436 assets) enables training across single- and multi-object scenes, with demonstrated generalization to real images and MIT Multi-Illumination data, producing plausible shadows and diffuse interreflections. The work highlights the potential of video foundation models to capture light, material, and shape interactions for realistic, controllable relighting with minimal explicit 3D asset reconstruction or ray tracing.

Abstract

Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.

Paper Structure

This paper contains 35 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Overview. We generate synthetic videos with a point light in motion while the scene is static. The first frame is fed to the generative branch (grey) as conditioning input, while per-frame lighting signals (5D vector) are provided as global information to the control branch (green).
  • Figure 2: Examples of light trajectories for Single-Object (Top) and Multi-Object--Flying-Light (Bottom)
  • Figure 3: Object-Level:Qualitative Evaluation: Baseline comparisons with WS-SIR Yi_2023_CVPR, Neural Gaffer Jin2024NEURIPS_Neural_Gaffer_Relighting, IC-Light iclight, DiLightNet zeng2024dilightnet and Diffusion Renderer DiffusionRenderer. We provide ground-truth rendered backgrounds for IC-Light & Neural Gaffer.
  • Figure 4: Object-Level: Qualitative Evaluation. This result demonstrates generalization to real images, where the model was trained on synthetic dataset: Single-Object. The first row has input images from Real-Data--Single-Object and the last column shows the reference light position. Note: The camera angle is very different for each image and does not exactly match the reference. Please treat the reference as a reference and not absolute. The results show that GenLit can insert point lights very close to the object, casting shadows that are plausibly appropriate with respect to the light position.
  • Figure 5: Quantitative Evaluation: Perceptual Study. Each violin plot shows the distribution of the Likert scores (1 = not realistic, 5 = very realistic) conducted on Real-Data-Multi-Object, MIT Multi-Illumination test set and Real-Data--Single-Object respectively. Across all three studies, both the mean and median are close to 4, and the distributions are narrow, indicating that participants consistently perceived the rendered results as "realistic".
  • ...and 12 more figures