Table of Contents
Fetching ...

UniLight: A Unified Representation for Lighting

Zitian Zhang, Iliyan Georgiev, Michael Fischer, Yannick Hold-Geoffroy, Jean-François Lalonde, Valentin Deschaintre

TL;DR

<3-5 sentence high-level summary> UniLight introduces a unified latent lighting representation that bridges text, images, irradiance, and environment maps through modality-specific encoders trained with a cross-modal contrastive objective and an auxiliary spherical-harmonics loss. The approach relies on a compact fusion module to produce a shared embedding, enabling cross-modal retrieval, environment-map generation, and light-controlled diffusion-based synthesis. A multi-modal dataset with aligned modalities supports robust training and evaluation across lighting tasks. The results demonstrate transferable, directional lighting understanding and practical control for lighting-aware image synthesis and editing.

Abstract

Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

UniLight: A Unified Representation for Lighting

TL;DR

<3-5 sentence high-level summary> UniLight introduces a unified latent lighting representation that bridges text, images, irradiance, and environment maps through modality-specific encoders trained with a cross-modal contrastive objective and an auxiliary spherical-harmonics loss. The approach relies on a compact fusion module to produce a shared embedding, enabling cross-modal retrieval, environment-map generation, and light-controlled diffusion-based synthesis. A multi-modal dataset with aligned modalities supports robust training and evaluation across lighting tasks. The results demonstrate transferable, directional lighting understanding and practical control for lighting-aware image synthesis and editing.

Abstract

Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

Paper Structure

This paper contains 26 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Dataset creation. Starting from an environment map, we extract 9 images and use Prism dirik2025prism to estimate their intrinsics. The images further serve as input to DiffusionLight-Turbo Chinchuthakun2025DiffusionLightTurbo for environment-map estimation and to a VLM wang2025internvl3 to produce a text description (see \ref{['sec:Dataset']} for details).
  • Figure 2: Overview of our embedding approach. Image- and text-based lighting modalities (see \ref{['sec:Encoders']}) are first embedded using DINOv2 and Qwen3, respectively. All modalities are then processed by lightweight fusion modules which are trained contrastively to output into our joint latent space, UniLight. To improve latent-space coherence, a linear-probing head estimates spherical-harmonics (SH) coefficients from the latents, and a dedicated loss aligns these coefficients to ground-truth coefficients extracted from the environment map.
  • Figure 3: Analysis of light direction encoding in our unified representation. The environment map is rotated about the vertical axis (x-axis), and the resulting cosine similarity against the original orientation is shown. Similarity decreases with increasing rotation, indicating that the latent features explicitly encode light direction.
  • Figure 4: Cosine similarity (top, higher is better), inverse rank (bottom left, higher is better), and rank (bottom right, lower is better) between different modalities.
  • Figure 5: We visualize the SH coefficients extracted by our SH head (see \ref{['fig:light_encoders']}) from the UniLight embeddings of the input modalities (rows 1 and 3). We render the predicted SH to an environment map (rows 2 and 4) and visualize the dominant light direction as a red cross. Note how both the reconstructed environment maps and the dominant light directions align across modalities as well as with the reference, indicating that they indeed have similar latent embeddings.
  • ...and 4 more figures