Table of Contents
Fetching ...

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, Xin Tong

TL;DR

DiLightNet introduces fine-grained lighting control for diffusion-based image generation by guiding the sampling process with radiance hints derived from a coarse foreground geometry. The method uses a three-stage pipeline: (1) generate a provisional image under uncontrolled lighting, (2) compute radiance hints from a coarse shape and resynthesize the foreground with a radiance-hint conditioned ControlNet (DiLightNet) while multiplying encoded provisional texture, and (3) inpaint the background to match the target lighting. Training relies on a large synthetic dataset with diverse shapes, materials, and lighting, enabling robust radiance-hint guidance and per-pixel conditioning. The approach achieves consistent lighting control across prompts and lighting conditions, supports user control via appearance-seeds and material prompts, and demonstrates competitive performance with ablations highlighting the importance of radiance-hint encoding, masking, and augmentation. This work enables practical, flexible lighting design in text-to-image generation, with potential extensions to material estimation and text-to-3D generation.

Abstract

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

TL;DR

DiLightNet introduces fine-grained lighting control for diffusion-based image generation by guiding the sampling process with radiance hints derived from a coarse foreground geometry. The method uses a three-stage pipeline: (1) generate a provisional image under uncontrolled lighting, (2) compute radiance hints from a coarse shape and resynthesize the foreground with a radiance-hint conditioned ControlNet (DiLightNet) while multiplying encoded provisional texture, and (3) inpaint the background to match the target lighting. Training relies on a large synthetic dataset with diverse shapes, materials, and lighting, enabling robust radiance-hint guidance and per-pixel conditioning. The approach achieves consistent lighting control across prompts and lighting conditions, supports user control via appearance-seeds and material prompts, and demonstrates competitive performance with ablations highlighting the importance of radiance-hint encoding, masking, and augmentation. This work enables practical, flexible lighting design in text-to-image generation, with potential extensions to material estimation and text-to-3D generation.

Abstract

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.
Paper Structure (36 sections, 23 figures, 2 tables)

This paper contains 36 sections, 23 figures, 2 tables.

Figures (23)

  • Figure 1: Examples of generated images specified via a text-prompt (listed below each example) and with fine-grained lighting control. Each prompt is plausibly visualized under two different user-provided lighting environments.
  • Figure 2: Examples of lighting bias in diffusion-based image generation. Left: a batch of $16$ images (text prompt: "a photo of a soccer ball"). The majority of the images are lit by a flash light; only two exhibit off-center lighting (3rd row, 1st column and 3rd column). Right: a batch of generated images of a robot dominated by light coming from either the front-left or front-right (text prompt: "a photo of a toy robot standing on a wooden table"; images are generated with a depth conditioned model to ensure a consistent shape).
  • Figure 3: Overview of our pipeline for lighting-controlled prompt-driven image synthesis: (1) We start by generating a provisional image using a pretrained diffusion model under uncontrolled lighting given a text prompt and a content-seed. (2) Next, we pass an appearance-seed, the provisional image, and a set of radiance hints (computed from the target lighting and a coarse estimate of the depth) to DiLightNet that will resynthesize the image such that becomes consistent with the target lighting while retaining the content of the provisional image. (3) Finally, we inpaint the background to be consistent with foreground object and the target lighting.
  • Figure 4: Provisional image encoder architecture. The output of the encoder is channel-wise multiplied with the radiance hints before passing the resulting $12$-channel feature map to a ControlNet.
  • Figure 5: Text-to-image generated results with lighting control. The first column shows the provisional image as a reference, whereas the last five columns are generated under different user-specified lighting conditions (point lighting (columns 2-3) and environment lighting (columns 4-6)). The provisional images for the last two examples are generated with DALL-E3 instead of stable diffusion v2.1 to better handle the more complex prompt.
  • ...and 18 more figures