Table of Contents
Fetching ...

Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics

Alara Dirik, Stefanos Zafeiriou

Abstract

Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.

Geo-ID: Test-Time Geometric Consensus for Cross-View Consistent Intrinsics

Abstract

Intrinsic image decomposition aims to estimate physically based rendering (PBR) parameters such as albedo, roughness, and metallicity from images. While recent methods achieve strong single-view predictions, applying them independently to multiple views of the same scene often yields inconsistent estimates, limiting their use in downstream applications such as editable neural scenes and 3D reconstruction. Video-based models can improve cross-frame consistency but require dense, ordered sequences and substantial compute, limiting their applicability to sparse, unordered image collections. We propose Geo-ID, a novel test-time framework that repurposes pretrained single-view intrinsic predictors to produce cross-view consistent decompositions by coupling independent per-view predictions through sparse geometric correspondences that form uncertainty-aware consensus targets. Geo-ID is model-agnostic, requires no retraining or inverse rendering, and applies directly to off-the-shelf intrinsic predictors. Experiments on synthetic benchmarks and real-world scenes demonstrate substantial improvements in cross-view intrinsic consistency as the number of views increases, while maintaining comparable single-view decomposition performance. We further show that the resulting consistent intrinsics enable coherent appearance editing and relighting in downstream neural scene representations.
Paper Structure (37 sections, 5 equations, 9 figures, 9 tables)

This paper contains 37 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Existing single-view intrinsic predictors estimate PBR parameters independently per view, leading to cross-view inconsistencies. Geo-ID produces consistent decompositions by coupling per-view predictions through geometric correspondences at test time, enabling coherent material editing and relighting in downstream neural scene representations.
  • Figure 2: Overview of Geo-ID. The pipeline consists of three phases. (1) Geometry-guided correspondence estimation: (a) We use a pretrained geometry transformer (VGGT) to predict camera parameters and dense 3D point maps with confidence from the input views. (2) Voxel-based intrinsic consensus: (b) A pretrained single-view diffusion model first produces independent intrinsic predictions for each view. (c) We voxelise high-confidence 3D points and aggregate the corresponding intrinsic values into a robust voxel-level consensus, which we reproject into each image. (3) Consensus-guided diffusion: (d) We run a second diffusion pass per view and inject the view-projected consensus as sparse guidance at selected denoising steps, producing cross-view consistent intrinsic predictions.
  • Figure 3: Qualitative comparison. For two scenes (one indoor MipNeRF-360, one outdoor Tanks & Temples), we show input views alongside albedo, metallicity, and roughness predictions from the base model (RGB$\leftrightarrow$X) applied independently per view and with our Geo-ID guidance. Without guidance, the base model produces plausible but inconsistent decompositions: note the color drift across views on the same surfaces (highlighted regions). Bottom row: zoomed-in crops of corresponding surface regions across two views, showing how Geo-ID reduces cross-view disagreement while preserving fine detail and decomposition quality.
  • Figure 4: Qualitative comparison with single-view, multi-view, and video intrinsic decomposition baselines on the MipNeRF-360 Garden scene. Each row shows the albedo prediction for a different input view. Single-view methods (RGB$\leftrightarrow$X, IDArb) produce detailed but cross-view inconsistent estimates, while video-based approaches (Diffusion Renderer) generates multi-view consistent results, albeit fails to remove lighting effects. Geo-ID (rightmost column, 16-view setting) maintains the detail of single-view predictions while achieving cross-view consistency.
  • Figure 5: Downstream applications.Top: We train MeshSplatting Held2025MeshSplattingDR on Geo-ID-predicted albedo maps and render the reconstructed surface meshes under novel HDR environment maps. Rows correspond to different lighting conditions; columns show rendered views. Bottom: We manually segment target regions on the extracted meshes and modify their albedo to demonstrate material editing.
  • ...and 4 more figures