Table of Contents
Fetching ...

LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models

Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral, Joost Van De Weijer

TL;DR

LumiCtrl addresses the lack of explicit illuminant control in text-to-image diffusion by learning illuminant prompts from a single image. It integrates physics-based Planckian augmentation, edge-guided prompt disentanglement via a frozen ControlNet, and a foreground-focused masked reconstruction loss to achieve contextual light adaptation. The approach diagnoses a semantic gap in illuminant grounding within text encoders and demonstrates superior illuminant fidelity, aesthetic quality, and scene coherence compared with prior personalization and editing methods, backed by a user study. This work enables precise, content-preserving lighting control in personalized T2I generation, with potential for broader applications in design and visual storytelling.

Abstract

Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and (c) a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.

LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models

TL;DR

LumiCtrl addresses the lack of explicit illuminant control in text-to-image diffusion by learning illuminant prompts from a single image. It integrates physics-based Planckian augmentation, edge-guided prompt disentanglement via a frozen ControlNet, and a foreground-focused masked reconstruction loss to achieve contextual light adaptation. The approach diagnoses a semantic gap in illuminant grounding within text encoders and demonstrates superior illuminant fidelity, aesthetic quality, and scene coherence compared with prior personalization and editing methods, backed by a user study. This work enables precise, content-preserving lighting control in personalized T2I generation, with potential for broader applications in design and visual storytelling.

Abstract

Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and (c) a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Analyzing the capability of T2I generative models. (a) Stable Diffusion fails to generate scenes under text-guided illumination presets often producing no meaningful lighting change. (b) T2I personalization methods predominantly preserve lighting from training examples, failing to generate concepts under various illuminants. d (c) Flat Light Adaptation loses photo-realism due to unnatural color intensity. (d) IC-Light can adjust lighting direction but struggles with spatial consistency and often produces unnatural shadows or color casts. The central heatmaps quantify the MSE (see section \ref{['subsec:embedding_analysis']}). All methods exhibit high error, confirming the fundamental challenge of illuminant control in current T2I pipelines.
  • Figure 2: Comparison of illuminant embeddings across ViT-based CLIP models. Points represent embeddings of (i) named illuminants, (ii) Kelvin values, (iii) general light names, and (iv) generic numerals. Named illuminants and Kelvin temperatures fail to cluster with semantically related lighting terms---tungsten is distant from warm, and 2850K clusters with generic numbers rather than lighting concepts.
  • Figure 3: Silhouette scores measuring the separability of illuminant-related embedding clusters across CLIP-based text encoders (ViT-B/32, ViT-L/14, ViT-H/14, ViT-g/14) at both token and sentence levels. Higher scores indicate better-defined, more disentangled clusters. The results reveal consistently low or negative scores for groupings that should be semantically aligned—such as named illuminants and their descriptive equivalents or physical Kelvin values---demonstrating a lack of semantic coherence. Moreover, Kelvin values exhibit high silhouette scores when grouped with generic numerals, confirming that they are interpreted as plain numbers rather than photometric lighting cues.
  • Figure 4: An overview --- LumiCtrl consists of three components. Firstly, given an image and text-prompt, our method augments image under daylight illuminants using physics-based color augmentation to learn embeddings. Next, we introduce text-tokens to learn illuminant representations. During training, we only optimize key and value projection matrices in diffusion model cross-attention layers, along with modifier tokens. We employ ControlNet for Edge-Guided Prompt Disentanglement. Third, we introduce masked reconstruction loss to enforce focus in foreground to improve learning. At inference time, ControlNet is discarded.
  • Figure 5: Qualitative results of LumiCtrl illuminating real and T2I generated concepts given text prompts under three settings: (a) Portrait, (b) Indoor, and (c) Outdoor illumination.
  • ...and 4 more figures