GenLit: Reformulating Single-Image Relighting as Video Generation
Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, Michael J. Black
TL;DR
GenLit reframes single-image relighting as controllable video synthesis by keeping scene geometry static and animating lighting via a moving point light. It fine-tunes a controllable version of Stable Video Diffusion with a 5D lighting vector to synthesize relight sequences, obviating explicit inverse-rendering pipelines. A new Objaverse-GenLit synthetic dataset (1436 assets) enables training across single- and multi-object scenes, with demonstrated generalization to real images and MIT Multi-Illumination data, producing plausible shadows and diffuse interreflections. The work highlights the potential of video foundation models to capture light, material, and shape interactions for realistic, controllable relighting with minimal explicit 3D asset reconstruction or ray tracing.
Abstract
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.
