Table of Contents
Fetching ...

GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

Ahmet Burak Yildirim, Tuna Saygin, Duygu Ceylan, Aysegul Dundar

TL;DR

GeoFusionLRM addresses geometric inconsistencies in single-image 3D reconstruction by introducing a geometry-aware self-correction framework. It adds a GeoFormer encoder that processes depth and normal maps from intermediate reconstructions and a GeoFuser that fuses geometry-aware tokens with semantic image features to refine subsequent reconstruction passes. The method unrolls three refinement steps during training and uses existing losses, achieving state-of-the-art or superior normal-map fidelity on OmniObject3D and Google Scanned Objects datasets. This approach meaningfully improves mesh- conditioning alignment with input views, enabling sharper geometry and better-detail preservation without external supervision, at the cost of increased inference time due to refinement passes.

Abstract

Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.

GeoFusionLRM: Geometry-Aware Self-Correction for Consistent 3D Reconstruction

TL;DR

GeoFusionLRM addresses geometric inconsistencies in single-image 3D reconstruction by introducing a geometry-aware self-correction framework. It adds a GeoFormer encoder that processes depth and normal maps from intermediate reconstructions and a GeoFuser that fuses geometry-aware tokens with semantic image features to refine subsequent reconstruction passes. The method unrolls three refinement steps during training and uses existing losses, achieving state-of-the-art or superior normal-map fidelity on OmniObject3D and Google Scanned Objects datasets. This approach meaningfully improves mesh- conditioning alignment with input views, enabling sharper geometry and better-detail preservation without external supervision, at the cost of increased inference time due to refinement passes.

Abstract

Single-image 3D reconstruction with large reconstruction models (LRMs) has advanced rapidly, yet reconstructions often exhibit geometric inconsistencies and misaligned details that limit fidelity. We introduce GeoFusionLRM, a geometry-aware self-correction framework that leverages the model's own normal and depth predictions to refine structural accuracy. Unlike prior approaches that rely solely on features extracted from the input image, GeoFusionLRM feeds back geometric cues through a dedicated transformer and fusion module, enabling the model to correct errors and enforce consistency with the conditioning image. This design improves the alignment between the reconstructed mesh and the input views without additional supervision or external signals. Extensive experiments demonstrate that GeoFusionLRM achieves sharper geometry, more consistent normals, and higher fidelity than state-of-the-art LRM baselines.
Paper Structure (16 sections, 5 equations, 5 figures, 5 tables)

This paper contains 16 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Qualitative comparison using a synthesized input image generated by the FLUX image generator. The same synthesized image is provided as input to the InstantMesh baseline and our proposed GeoFusionLRM. The baseline struggles to preserve geometric fidelity, producing distorted normals and misaligned surface details. In contrast, our iterative geometric conditioning progressively corrects these errors, yielding reconstructions with sharper normals and RGB renderings that more closely match the GT view.
  • Figure 2: Overview of the proposed GeoFusionLRM architecture. Given a conditioning image, semantic features are extracted with a pre-trained vision encoder, while geometric cues from normals and depths of the intermediate mesh are encoded by the geometry-aware GeoFormer. The GeoFuser module merges these two streams of embeddings at the token level to produce refined conditioning features, which guide the LRM in generating an updated 3D mesh. This process corrects residual geometric errors and improves the consistency of surface normals and RGB renderings with respect to the conditioning image.
  • Figure 3: Qualitative results on GSO. Columns show the conditioning input image (left), followed by LRM, SPAR3D, LGM, InstantMesh, and our GeoFusionLRM. For each method, we display results rendered from the same camera viewpoints, showing RGB outputs (top) and surface normals (bottom).
  • Figure 4: Performance across refinement iterations on the OmniObject3D dataset under uniform views.
  • Figure 5: Limitations on thin structure reconstruction. Our refinement improves coarse branches by closing gaps (see zooms), but very thin root segments remain missing due to the limited resolution of the InstantMesh triplane backbone.