Table of Contents
Fetching ...

LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Andreas Dengel

Abstract

Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.

LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation

Abstract

Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.
Paper Structure (20 sections, 4 equations, 6 figures, 2 tables)

This paper contains 20 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Existing prompt-engineering methods fail to generate differences in generated images with or without light-specific prompts, resulting in outputs that overlook specified lighting conditions. Our proposed LGTM effectively guides lighting during image generation, ensuring outputs align with text prompts and desired lighting directions without fine-tuning.
  • Figure 2: Overview of our proposed LGTM. The user inputs a prompt $p$ and a light condition $l$. The Light Conditional Generation module generates the light direction mask $m_l$ according to $l$ for manipulating the initial noise in Stable Diffusion. Then, the vanilla Stable Diffusion model integrates these inputs, dynamically adjusting the latent space—particularly channel 1—to reflect user-defined lighting conditions. Finally, it outputs the final image $I$.
  • Figure 3: Channel-wise sensitivity analysis via scaling the initial latent noise. We scale a single latent channel $z_T^{(c)}$ ($c\in\{1,2,3,4\}$) by a constant factor $\alpha$, while keeping the prompt, random seed, and the other channels fixed. Scaling channel 1 consistently induces global brightness changes and alters the perceived illumination direction, whereas scaling channels 2--4 mainly affects chromatic attributes with limited impact on lighting.
  • Figure 4: Qualitative Results. The existing model fails to control lighting conditions, often generating images with random or inconsistent lighting. In contrast, our approach effectively incorporates user-specified light direction and intensity, producing more natural and coherent lighting effects in the generated images.
  • Figure 5: Qualitative Results in Application. Existing models integrated with ControlNet zhang2023adding control edges but fail to handle lighting, often producing inconsistent results. Our approach combines user-specified light direction and intensity with edge control, generating images with natural lighting and precise structure, demonstrating versatility in handling multiple controls.
  • ...and 1 more figures