Table of Contents
Fetching ...

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin

TL;DR

This work targets intrinsic decomposition from an arbitrary number of views under unconstrained illumination by introducing IDArb, a diffusion-based model built on a cross-view and cross-component attention framework. It leverages the ARB-Objaverse dataset (5.7M synthetic images across 68k models) and an illumination-augmented, view-adapted training strategy to achieve multi-view-consistent estimates of albedo $\mathbf{A}$, normal $\mathbf{N}$, metallic $\mathbf{M}$, and roughness $\mathbf{R}$, without requiring camera poses. IDArb achieves state-of-the-art performance on synthetic and real data, enabling downstream tasks such as single-image relighting, material editing, and photometric stereo, while also providing priors to improve optimization-based inverse rendering. The approach demonstrates practical potential for realistic 3D content creation and highlights avenues for handling diverse lighting and view configurations in inverse rendering.

Abstract

Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

TL;DR

This work targets intrinsic decomposition from an arbitrary number of views under unconstrained illumination by introducing IDArb, a diffusion-based model built on a cross-view and cross-component attention framework. It leverages the ARB-Objaverse dataset (5.7M synthetic images across 68k models) and an illumination-augmented, view-adapted training strategy to achieve multi-view-consistent estimates of albedo , normal , metallic , and roughness , without requiring camera poses. IDArb achieves state-of-the-art performance on synthetic and real data, enabling downstream tasks such as single-image relighting, material editing, and photometric stereo, while also providing priors to improve optimization-based inverse rendering. The approach demonstrates practical potential for realistic 3D content creation and highlights avenues for handling diverse lighting and view configurations in inverse rendering.

Abstract

Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

Paper Structure

This paper contains 24 sections, 3 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: IDArb tackles intrinsic decomposition for an arbitrary number of views under unconstrained illumination. Our approach (a) achieves multi-view consistency compared to learning-based methods and (b) effectively disentangles intrinsic components from lighting effects compared to optimization-based methods. Our method enhances a wide range of applications such as image editing, photometric stereo, and 3D reconstruction.
  • Figure 2: Top: Overview of IDArb. Bottom: Illustration of the attention block within the UNet. Our training batch consists of $N$ input images, sampled from $N_v$ viewpoints and $N_i$ illuminations. The latent vector for each image is concatenated with Gaussian noise for denoising. Intrinsic components are divided into three triplets ($D$=3): Albedo, Normal and Metallic&Roughness. Specific text prompts are used to guide the model toward different intrinsic components. For attention block inside UNet, we introduce cross-component and cross-view attention module into it, where attention is applied across components and views, facilitating global information exchange.
  • Figure 3: Overview of the Arb-Objaverse dataset. Our custom dataset features a diverse collection of objects rendered under various lighting conditions, accompanied by their intrinsic components.
  • Figure 4: Qualitative comparison on synthetic data. IDArb demonstrates superior intrinsic estimation compared to all other methods.
  • Figure 5: Qualitative comparison on real-world data.
  • ...and 14 more figures