Table of Contents
Fetching ...

Learning Latent Proxies for Controllable Single-Image Relighting

Haoze Zheng, Zihao Wang, Xianfeng Wu, Yajing Bai, Yexin Liu, Yun Li, Xiaogang Xu, Harry Yang

Abstract

Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.

Learning Latent Proxies for Controllable Single-Image Relighting

Abstract

Single-image relighting is highly under-constrained: small illumination changes can produce large, nonlinear variations in shading, shadows, and specularities, while geometry and materials remain unobserved. Existing diffusion-based approaches either rely on intrinsic or G-buffer pipelines that require dense and fragile supervision, or operate purely in latent space without physical grounding, making fine-grained control of direction, intensity, and color unreliable. We observe that a full intrinsic decomposition is unnecessary and redundant for accurate relighting. Instead, sparse but physically meaningful cues, indicating where illumination should change and how materials should respond, are sufficient to guide a diffusion model. Based on this insight, we introduce LightCtrl that integrates physical priors at two levels: a few-shot latent proxy encoder that extracts compact material-geometry cues from limited PBR supervision, and a lighting-aware mask that identifies sensitive illumination regions and steers the denoiser toward shading relevant pixels. To compensate for scarce PBR data, we refine the proxy branch using a DPO-based objective that enforces physical consistency in the predicted cues. We also present ScaLight, a large-scale object-level dataset with systematically varied illumination and complete camera-light metadata, enabling physically consistent and controllable training. Across object and scene level benchmarks, our method achieves photometrically faithful relighting with accurate continuous control, surpassing prior diffusion and intrinsic-based baselines, including gains of up to +2.4 dB PSNR and 35% lower RMSE under controlled lighting shifts.
Paper Structure (28 sections, 18 equations, 16 figures, 5 tables)

This paper contains 28 sections, 18 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Our method enables precise and continuous control over illumination, including light direction (e.g., $60^\circ$ down or right), intensity (e.g., dimmed by 200 lux or brightened by 50 lux), and color temperature (e.g., shifted to 1800K or +2000K). Given only a single source image, the model produces physically consistent relighting while preserving fine textures, specular highlights, and material appearance. The examples illustrate how directional, photometric, and chromatic adjustments are faithfully reflected in the output. (Note: The bottom-left image was captured with a mobile phone and cropped before processing.)
  • Figure 2: Overview of LightCtrl and the object-level rendering pipeline. Our model integrates a few-shot latent proxy, a lighting-aware mask, and a DPO-refined PBR encoder to condition the UNet during denoising. The implicit PBR encoder extracts physically-based latent proxy features $t_{\mathrm{phys}}$, which are concatenated with image tokens $t_{\mathrm{image}}$ and light condition tokens $t_{\mathrm{light}}$ to guide the diffusion process. A lighting mask predictor takes the lighting change $\Delta\ell$ as input to produce a spatially-aware attention mask for the UNet. The output column shows relit results under three different lighting configurations ($\Delta\ell_1$, $\Delta\ell_2$, $\Delta\ell_3$), demonstrating the model's ability to produce diverse and physically plausible relighting effects from a single input image.
  • Figure 3: Visual comparison of intrinsic decomposition on a chair object. From left to right: ground-truth (GT), our method with DPO, our method without DPO, rgb2x zeng2024rgb, and Neural_Lightrig he2024neurallightrigunlockingaccurate. The DPO fine-tuning clearly suppresses artifacts and produces more accurate albedo and normal predictions, demonstrating the effectiveness of our DPO strategy.
  • Figure 4: Qualitative results on in-the-wild images. Our model produces realistic relighting on Internet and product photos (right). Hand-captured examples confirm robust generalization to uncontrolled lighting.
  • Figure 5: Object-level relighting comparison. Given a single input image (left), we compare our method against IC-Light, Qwen Image Edit, RGB-X, and DiLightNet. Competing methods often introduce strong color shifts, texture distortions, or inconsistent shading, while our approach produces well-relighting results that preserve materials, geometry, and object identity. These examples demonstrate our model’s superior ability to perform physically plausible object-centric relighting.
  • ...and 11 more figures