Table of Contents
Fetching ...

Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark

Zhengfei Kuang, Yunzhi Zhang, Hong-Xing Yu, Samir Agarwala, Shangzhe Wu, Jiajun Wu

TL;DR

Stanford-ORB tackles the real-world evaluation gap for object inverse rendering by introducing a dataset with ground-truth 3D scans, HDR multi-view images, and environment lighting for 14 objects across 7 scenes. It defines three evaluation tasks—geometry estimation, novel scene relighting, and novel view synthesis—and provides a full capture, processing, and pose-registration pipeline. A broad set of baselines across material decomposition, NeRF/IDR-based geometry, and single-view intrinsics are benchmarked, revealing that differentiable Monte Carlo renderers like NVDiffRecMC improve relighting and view synthesis, while explicit geometry representations favor precise shape reconstruction. The work releases data, code, and evaluation protocols, enabling rigorous, real-world benchmarking and highlighting remaining gaps in generalizing inverse rendering to complex, real-world lighting.

Abstract

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. Using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.

Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark

TL;DR

Stanford-ORB tackles the real-world evaluation gap for object inverse rendering by introducing a dataset with ground-truth 3D scans, HDR multi-view images, and environment lighting for 14 objects across 7 scenes. It defines three evaluation tasks—geometry estimation, novel scene relighting, and novel view synthesis—and provides a full capture, processing, and pose-registration pipeline. A broad set of baselines across material decomposition, NeRF/IDR-based geometry, and single-view intrinsics are benchmarked, revealing that differentiable Monte Carlo renderers like NVDiffRecMC improve relighting and view synthesis, while explicit geometry representations favor precise shape reconstruction. The work releases data, code, and evaluation protocols, enabling rigorous, real-world benchmarking and highlighting remaining gaps in generalizing inverse rendering to complex, real-world lighting.

Abstract

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment lighting. Using this dataset, we establish the first comprehensive real-world evaluation benchmark for object inverse rendering tasks from in-the-wild scenes, and compare the performance of various existing methods.
Paper Structure (37 sections, 1 equation, 12 figures, 2 tables)

This paper contains 37 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Data Capture Pipeline Overview. For each object, Left: we obtain its 3D shape using a 3D scanner and Physics-Based Rendering (PBR) materials using high-quality light box images. Middle: we also capture multi-view masked images in 3 different in-the-wild scenes, together with the ground-truth environment maps. Right: we carefully register the camera poses for all images using the scanned mesh and recovered materials, and prepare the data for the evaluation benchmarks. Credit to Maurice Svay low-poly-camera for the low-poly camera mesh model.
  • Figure 2: Selection of Objects. From top to bottom: Block, Gnome, Ball, Car; Curry, Pepsi, Salt, Baking, Chips; Cactus, Pitcher, Grogu, Cup, Teapot.
  • Figure 3: Studio Capture Setup. (a) 3D shape scanning. A: object, B: hand-held EinScan Pro HD 3D Scanner, C: printed patterns for camera registration, D: spray for high-quality scanning, E: desktop for processing. The scanned mesh is visualized in ExScan Pro exscanpro and MeshLab meshlab on the right. (b) Light box capture setup viewed from outside and inside. F: DSLR camera, G: cloth cover to block light from outside, H: object, I: printed patterns for camera registration, J: (optional) dark background for object segmentation, K: remote-controlled turntable. The captured image and the chrome ball are visualized on the right.
  • Figure 4: In-the-Wild Capture Setup. (a) Hardware for capturing. A: chrome ball, B: magenta platform for object segmentation, C: object, D: printed patterns for camera registration, E: DSLR camera, F: cloth for hiding photographer, G: mobile cart. (b) An example of the image-envmap pair. The environment map is solved from the reflection image on the chrome ball.
  • Figure 5: Data Processing Pipeline.Left: Overview of the data processing pipelines for both studio and in-the-wild captures. Three individual modules painted blue are expanded on the right. Middle-Top: The semi-automatically segmentation module produces object masks for all images. Middle-Bottom: Environment maps are solved from the chrome ball images. Right: Accurate camera poses are obtained from COLMAP and refined using NVDiffRec Munkberg2022nvdiffrec, given the scanned mesh and (for in-the-wild images) the pseudo materials optimized from light box captures.
  • ...and 7 more figures