Table of Contents
Fetching ...

ARM: Appearance Reconstruction Model for Relightable 3D Generation

Xiang Feng, Chang Yu, Zoubin Bi, Yintong Shang, Feng Gao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang

TL;DR

ARM presents a two-stage framework for relightable 3D generation from sparse views by decoupling geometry from appearance and performing texture synthesis in UV space. It introduces GeoRM for geometry and separate GlossyRM/InstantAlbedo components for appearance, aided by a material prior to robustly decompose lighting and materials. By back-projecting multi-view measurements into a UV atlas and employing a global-receptive-field UV module, ARM achieves richly detailed textures and realistic relighting, surpassing prior image-to-3D methods. Trained on 8 H100 GPUs, ARM demonstrates strong quantitative and qualitative gains across geometry and appearance tasks, including multi-material and relightable scenarios, highlighting its practical impact for games, metaverse assets, and e-commerce.

Abstract

Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.

ARM: Appearance Reconstruction Model for Relightable 3D Generation

TL;DR

ARM presents a two-stage framework for relightable 3D generation from sparse views by decoupling geometry from appearance and performing texture synthesis in UV space. It introduces GeoRM for geometry and separate GlossyRM/InstantAlbedo components for appearance, aided by a material prior to robustly decompose lighting and materials. By back-projecting multi-view measurements into a UV atlas and employing a global-receptive-field UV module, ARM achieves richly detailed textures and realistic relighting, surpassing prior image-to-3D methods. Trained on 8 H100 GPUs, ARM demonstrates strong quantitative and qualitative gains across geometry and appearance tasks, including multi-material and relightable scenarios, highlighting its practical impact for games, metaverse assets, and e-commerce.

Abstract

Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.

Paper Structure

This paper contains 25 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: ARM generates high-quality, relightable 3D content from a single image input. This figure presents sample results generated from different input images, demonstrating ARM's ability to reconstruct a variety of objects with spatially-varying appearance. Please refer to our supplementary video for results under dynamic view and lighting.
  • Figure 2: Overview of our pipeline.(left) Starting from sparse-view input images generated by a diffusion model shi2023zero123++, ARM separates shape and appearance generation into two stages. In the geometry stage, ARM uses GeoRM to predict a 3D shape from the input images. In the appearance stage, ARM employs InstantAlbedo and GlossyRM to reconstruct PBR maps, enabling realistic relighting under varied lighting conditions. (right) Both GeoRM and GlossyRM share the same architecture, consisting of a triplane synthesizer and a decoding MLP. GeoRM is trained to predict density and extracts an iso-surface from the density grid with DiffMC wei2023neumanifold, while GlossyRM is trained to predict roughness and metalness. GlossyRM is trained after GeoRM and initializes with the weights of GeoRM at the start of training.
  • Figure 3: Overview of InstantAlbedo. InstantAlbedo operates in the texture UV space. This process begins by converting all necessary data to UV texture space. Given the unwrapped mesh from GeoRM, we back-project images, material encodings, and auxiliary data into UV texture space, resulting in six sets of inputs corresponding to the six input views. InstantAlbedo then processes these maps using a U-Net and an inpainting-specific FFC-Net to predict both the lighting-baked color and the decomposed diffuse albedo UV textures.
  • Figure 4: Qualitative comparison. We present examples of single-image 3D generation across different methods. While other methods exhibit blurriness, ARM reconstructs complex patterns with sharp details. Please zoom in to examine the texture quality. Full results, including comparisons with LGM tang2024lgm and CRM wang2024crm, are provided in the supplementary material.
  • Figure 5: PBR comparison. We compare reconstructed PBR maps and relit images under novel lighting to SF3D boss2024sf3d. While SF3D produces constant roughness and material with lighting baked into the diffuse color (highlighted in the figure), our method generates spatially-varying appearance, with well-separated illumination and materials. See supplementary material for full results.
  • ...and 5 more figures