Table of Contents
Fetching ...

MAIR++: Improving Multi-view Attention Inverse Rendering with Implicit Lighting Representation

JunYong Choi, SeokYeong Lee, Haesol Park, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho

TL;DR

MAIR++ advances scene-level inverse rendering by introducing an implicit lighting representation (ILR), a directional attention-based multi-view aggregation module, and an albedo fusion mechanism, built atop an improved depth/geometry initializer (MGNet). It extends the MAIR framework to render realistic lighting and enable material editing and object insertion, demonstrating superior performance on synthetic OpenRooms FF data and robust generalization to unseen real-world scenes. The approach jointly learns per-pixel lighting in ILR, BRDFs, and 3D lighting volumes, achieving more faithful shading, reduced artifacts, and plausible edits compared to prior methods. This work has practical implications for VR/AR applications requiring accurate, controllable scene reconstructions and lighting-aware editing.

Abstract

In this paper, we propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, SVBRDF, and 3D spatially-varying lighting. While multi-view images have been widely used for object-level inverse rendering, scene-level inverse rendering has primarily been studied using single-view images due to the lack of a dataset containing high dynamic range multi-view images with ground-truth geometry, material, and spatially-varying lighting. To improve the quality of scene-level inverse rendering, a novel framework called Multi-view Attention Inverse Rendering (MAIR) was recently introduced. MAIR performs scene-level multi-view inverse rendering by expanding the OpenRooms dataset, designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Although MAIR showed impressive results, its lighting representation is fixed to spherical Gaussians, which limits its ability to render images realistically. Consequently, MAIR cannot be directly used in applications such as material editing. Moreover, its multi-view aggregation networks have difficulties extracting rich features because they only focus on the mean and variance between multi-view features. In this paper, we propose its extended version, called MAIR++. MAIR++ addresses the aforementioned limitations by introducing an implicit lighting representation that accurately captures the lighting conditions of an image while facilitating realistic rendering. Furthermore, we design a directional attention-based multi-view aggregation network to infer more intricate relationships between views. Experimental results show that MAIR++ not only achieves better performance than MAIR and single-view-based methods, but also displays robust performance on unseen real-world scenes.

MAIR++: Improving Multi-view Attention Inverse Rendering with Implicit Lighting Representation

TL;DR

MAIR++ advances scene-level inverse rendering by introducing an implicit lighting representation (ILR), a directional attention-based multi-view aggregation module, and an albedo fusion mechanism, built atop an improved depth/geometry initializer (MGNet). It extends the MAIR framework to render realistic lighting and enable material editing and object insertion, demonstrating superior performance on synthetic OpenRooms FF data and robust generalization to unseen real-world scenes. The approach jointly learns per-pixel lighting in ILR, BRDFs, and 3D lighting volumes, achieving more faithful shading, reduced artifacts, and plausible edits compared to prior methods. This work has practical implications for VR/AR applications requiring accurate, controllable scene reconstructions and lighting-aware editing.

Abstract

In this paper, we propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, SVBRDF, and 3D spatially-varying lighting. While multi-view images have been widely used for object-level inverse rendering, scene-level inverse rendering has primarily been studied using single-view images due to the lack of a dataset containing high dynamic range multi-view images with ground-truth geometry, material, and spatially-varying lighting. To improve the quality of scene-level inverse rendering, a novel framework called Multi-view Attention Inverse Rendering (MAIR) was recently introduced. MAIR performs scene-level multi-view inverse rendering by expanding the OpenRooms dataset, designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Although MAIR showed impressive results, its lighting representation is fixed to spherical Gaussians, which limits its ability to render images realistically. Consequently, MAIR cannot be directly used in applications such as material editing. Moreover, its multi-view aggregation networks have difficulties extracting rich features because they only focus on the mean and variance between multi-view features. In this paper, we propose its extended version, called MAIR++. MAIR++ addresses the aforementioned limitations by introducing an implicit lighting representation that accurately captures the lighting conditions of an image while facilitating realistic rendering. Furthermore, we design a directional attention-based multi-view aggregation network to infer more intricate relationships between views. Experimental results show that MAIR++ not only achieves better performance than MAIR and single-view-based methods, but also displays robust performance on unseen real-world scenes.
Paper Structure (22 sections, 38 equations, 18 figures, 8 tables)

This paper contains 22 sections, 38 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Experimental results on unseen real world data. First row: Material editing results that change the albedo of the apple. MAIR++ is the only method that preserves the apple's specularity and edits the material realistically. Second row: inverse rendering result. The single-view-based methods cis2020zhu2022montecarlo rely solely on contextual information, making it difficult to estimate the complex materials and geometry of real-world scenes.
  • Figure 2: Entire pipeline of MAIR. MAIR addresses the difficulty of inverse rendering by splitting the scene components into the normal, direct lighting, material, and spatially-varying light and progressively estimating them.
  • Figure 3: An illustration of MVANet when $K$=3. MVANet creates a value vector by encoding color, context feature, and specular feature, and uses multi-view weights as attention to create multi-view aggregated features. Since our goal is to obtain the BRDF of target-view($1$-view), in level-2, only the value vector of target view is processed. While MAIR used only the Mean-Variance Module, MAIR++ attempted to infer more complex relationships between mulit-views with the DAM.
  • Figure 4: MAIR++'s entire pipeline. Given RGB and MVS depth, MAIR++ performs single-view inverse rendering and infers surface lighting, then further infers it with multi-views. Finally, all information is integrated to calculate the 3D volume lighting.
  • Figure 5: Comparison of depth map quality in unseen real world data.
  • ...and 13 more figures