Table of Contents
Fetching ...

Unsupervised Intrinsic Image Decomposition with LiDAR Intensity Enhanced Training

Shogo Sato, Takuhiro Kaneko, Kazuhiko Murasaki, Taiga Yoshida, Ryuichi Tanida, Akisato Kimura

TL;DR

LIET is introduced, a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations and achieves comparable IID quality to the existing model with LiDAR intensity.

Abstract

Unsupervised intrinsic image decomposition (IID) is the process of separating a natural image into albedo and shade without these ground truths. A recent model employing light detection and ranging (LiDAR) intensity demonstrated impressive performance, though the necessity of LiDAR intensity during inference restricts its practicality. Thus, IID models employing only a single image during inference while keeping as high IID quality as the one with an image plus LiDAR intensity are highly desired. To address this challenge, we propose a novel approach that utilizes only an image during inference while utilizing an image and LiDAR intensity during training. Specifically, we introduce a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations. In addition, to enhance IID quality, we propose albedo-alignment loss and image-LiDAR conversion (ILC) paths. Albedo-alignment loss aligns the gray-scale albedo from an image to that inferred from LiDAR intensity, thereby reducing cast shadows in albedo from an image due to the absence of cast shadows in LiDAR intensity. Furthermore, to translate the input image into albedo and shade style while keeping the image contents, the input image is separated into style code and content code by encoders. The ILC path mutually translates the image and LiDAR intensity, which share content but differ in style, contributing to the distinct differentiation of style from content. Consequently, LIET achieves comparable IID quality to the existing model with LiDAR intensity, while utilizing only an image without LiDAR intensity during inference.

Unsupervised Intrinsic Image Decomposition with LiDAR Intensity Enhanced Training

TL;DR

LIET is introduced, a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations and achieves comparable IID quality to the existing model with LiDAR intensity.

Abstract

Unsupervised intrinsic image decomposition (IID) is the process of separating a natural image into albedo and shade without these ground truths. A recent model employing light detection and ranging (LiDAR) intensity demonstrated impressive performance, though the necessity of LiDAR intensity during inference restricts its practicality. Thus, IID models employing only a single image during inference while keeping as high IID quality as the one with an image plus LiDAR intensity are highly desired. To address this challenge, we propose a novel approach that utilizes only an image during inference while utilizing an image and LiDAR intensity during training. Specifically, we introduce a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations. In addition, to enhance IID quality, we propose albedo-alignment loss and image-LiDAR conversion (ILC) paths. Albedo-alignment loss aligns the gray-scale albedo from an image to that inferred from LiDAR intensity, thereby reducing cast shadows in albedo from an image due to the absence of cast shadows in LiDAR intensity. Furthermore, to translate the input image into albedo and shade style while keeping the image contents, the input image is separated into style code and content code by encoders. The ILC path mutually translates the image and LiDAR intensity, which share content but differ in style, contributing to the distinct differentiation of style from content. Consequently, LIET achieves comparable IID quality to the existing model with LiDAR intensity, while utilizing only an image without LiDAR intensity during inference.
Paper Structure (14 sections, 9 equations, 5 figures, 4 tables)

This paper contains 14 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Train/infer schemes and examples of inferred albedos by (a) USI$^3$D liu2020, (b) IID-LI Sato2023, and (c) LIET (our proposed model). USI$^3$D, utilizing only a single image for both training and inference, leaves cast shadows on the inferred albedo. On the other hand, IID-LI utilizes LiDAR intensity during training and inference, making shadows further less noticeable. However, IID-LI has restricted applicability due to the requirement of LiDAR intensity even during inference. LIET utilizes only an image for inference to expand its usage scenarios, and utilizes both an image and its corresponding LiDAR intensity during training to make shadows less noticeable.
  • Figure 2: Examples of (a) input image, (b) gray-scale image, and its corresponding (c) LiDAR intensity and (d) LiDAR depth. The red circle indicates the regions with cast shadows and white arrows. The shadow and the white arrow are visible in gray-scale image. LiDAR intensity has no cast shadows while maintaining white arrows, since LiDAR intensity is calculated from the intensity ratio of irradiated and reflected lights, equivalent to an albedo at infrared wavelength. Conversely, LiDAR depth represents the distance to objects, resulting in the absence of cast shadows and white arrows.
  • Figure 3: LIET architecture including (a) within-domain reconstruction and (b) cross-domain translation. (a) For each domain (image ${\rm{I}}$, LiDAR intensity ${\rm{L}}$, albedo ${\rm{R}}$, shade ${\rm{S}}$), an input $x_{\rm{X}}$ is fed into style $E^p_{\rm{X}}$ and content $E^c_{\rm{X}}$ encoders to calculate style $p_{\rm{X}}$ and content $c_{\rm{X}}$ codes for domain $X\in\{\rm{I, L, R, S}\}$. These codes are used at generators $G_{\rm{X}}$ to reconstruct the inputs within their domains. (b) The image-encoder path accepts an image $x_{\rm{I}}$ and infers image style $p_{\rm{I}}$ and content $c_{\rm{I}}$ codes as within-domain reconstruction. Subsequently, $p_{\rm{I}}$ is fed into a style mapping function $f_{\rm{I}}$ to yield domain-specific style codes ($p_{\rm{RI}}$, $p_{\rm{SI}}$, $p_{\rm{LI}}$) for generating respective domains via generators ($G_{\rm{R}}, G_{\rm{S}}, G_{\rm{L}}$). Similarly, the LiDAR-encoder path uses $x_{\rm{L}}$ to infer albedo $x_{\rm{RL}}$, shade $x_{\rm{SL}}$, and image $x_{\rm{IL}}$ through LiDAR style and content encoders ($E^p_{\rm{L}}, E^c_{\rm{L}}$). The albedo-alignment loss $\mathcal{L}^{\rm{AA}}$ aligns the gray-scaled albedo from an image to that inferred from the LiDAR intensity for reducing cast shadows.
  • Figure 4: Calculation process of albedo-alignment loss $\mathcal{L}^{\rm{AA}}$. First, albedo from an image $x_{\rm{RI}}$ and that from its corresponding LiDAR intensity $x_{\rm{RL}}$ are computed. Subsequently, these albedos are masked to the points with LiDAR values and then gray scale. Next, instance normalization is performed to align the scales of these albedos, and the distance between these scaled albedos is calculated. A stop gradient is performed on the LiDAR-encoder path side to align $x_{\rm{RI}}$ to $x_{\rm{RL}}$ since the LiDAR intensity is independent of sunlight conditions. LiDAR intensity, scaled albedo from the image, and that from LiDAR intensity are represented in a cividis color map due to their gray scale.
  • Figure 5: Examples of inferring results obtained from various existing models and LIET (Ours) with NTT-IID dataset Sato2023. The compared models include Revisiting$^*$fan2018, IIDWW li2018, UidSequence Lettry2018, USI$^3$D liu2020, and IID-LI Sato2023. Shadows are less noticeable on the IID-LI and LIET, while cast shadows are visibly retained on the existing models without LiDAR intensity utilization.