Table of Contents
Fetching ...

SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model

Hala Djeghim, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Céline Loscos, Désiré Sidibé

TL;DR

SAIL tackles intrinsic image decomposition for real-world images by producing albedo-like representations via a latent-diffusion prior, trained with unlabeled multi-illumination data. It represents each latent image as $z_i = z^A + z_i^E$, where $z^A$ is the albedo latent decoded to $A$ and $z_i^E$ encodes lighting, and optimizes an unconditioned relighting objective along with latent-space regularizers. The framework operates entirely in the latent space with a diffusion-based decoder, enabling robust albedo estimation and enabling downstream relighting and appearance editing without labeled data. Empirical results on MIDIntrinsics and in-the-wild datasets show improved albedo consistency and competitive performance relative to both supervised and self-supervised baselines, demonstrating practical utility for real-world intrinsic decomposition and relighting tasks.

Abstract

Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.

SAIL: Self-supervised Albedo Estimation from Real Images with a Latent Diffusion Model

TL;DR

SAIL tackles intrinsic image decomposition for real-world images by producing albedo-like representations via a latent-diffusion prior, trained with unlabeled multi-illumination data. It represents each latent image as , where is the albedo latent decoded to and encodes lighting, and optimizes an unconditioned relighting objective along with latent-space regularizers. The framework operates entirely in the latent space with a diffusion-based decoder, enabling robust albedo estimation and enabling downstream relighting and appearance editing without labeled data. Empirical results on MIDIntrinsics and in-the-wild datasets show improved albedo consistency and competitive performance relative to both supervised and self-supervised baselines, demonstrating practical utility for real-world intrinsic decomposition and relighting tasks.

Abstract

Intrinsic image decomposition aims at separating an image into its underlying albedo and shading components, isolating the base color from lighting effects to enable downstream applications such as virtual relighting and scene editing. Despite the rise and success of learning-based approaches, intrinsic image decomposition from real-world images remains a significant challenging task due to the scarcity of labeled ground-truth data. Most existing solutions rely on synthetic data as supervised setups, limiting their ability to generalize to real-world scenes. Self-supervised methods, on the other hand, often produce albedo maps that contain reflections and lack consistency under different lighting conditions. To address this, we propose SAIL, an approach designed to estimate albedo-like representations from single-view real-world images. We repurpose the prior knowledge of a latent diffusion model for unconditioned scene relighting as a surrogate objective for albedo estimation. To extract the albedo, we introduce a novel intrinsic image decomposition fully formulated in the latent space. To guide the training of our latent diffusion model, we introduce regularization terms that constrain both the lighting-dependent and independent components of our latent image decomposition. SAIL predicts stable albedo under varying lighting conditions and generalizes to multiple scenes, using only unlabeled multi-illumination data available online.

Paper Structure

This paper contains 37 sections, 10 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Self-Supervised Albedo Estimation from Real Images -- From a single image under real-world lighting conditions, SAIL extracts high-fidelity albedo by repurposing and finetuning a pretrained latent diffusion model (left). The estimated albedo enables downstream tasks such as single-image virtual relighting, demonstrated using Blender blender with different environment maps (right).
  • Figure 2: SAIL overview -- Given a single input image, encoded into the latent space using the frozen pre-trained VAE encoder, SAIL estimates an albedo representation in latent space, which, when decoded, produces an albedo image without any lighting effects.
  • Figure 3: SAIL training -- SAIL performs image intrinsic decomposition ($\hat{z}_i^A$, $\hat{z}_i^E$) conditioned on a source latent $z_i$, through a surrogate objective of image relighting ($\mathcal{L}_{\text{relight}}$ and $\mathcal{L}_{\text{consistency}}$). Considering multiple illuminations $i,j$ of the same scene, we constraint the predicted latent albedo extracted from these sources latents to be identical ($\mathcal{L}_{\text{albedo}}$).
  • Figure 4: We qualitatively compare the predicted albedos on the MITDataset murmann19. We show that SAIL predicts consistent albedos from the same scene under various lighting conditions.
  • Figure 5: We qualitatively compare the predicted albedos on the BigTime li2018learning. We show that SAIL predicts consistent albedos from the same scene under various lighting conditions.
  • ...and 5 more figures