Table of Contents
Fetching ...

MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

Dongki Jung, Jaehoon Choi, Yonghan Lee, Sungmin Eum, Heesung Kwon, Dinesh Manocha

TL;DR

MoRe introduces a training-free monocular geometry refinement that achieves cross-view consistency and scale alignment for point maps produced by monocular 3D foundation models. It first performs an affine alignment using inter-view correspondences, then refines the result with a graph-based locally planar optimization that jointly optimizes 3D points and surface normals. This approach preserves the underlying scene structure while mitigating monocular scale ambiguities and improves novel-view synthesis in sparse-view regimes. Across standard benchmarks, MoRe delivers competitive multi-view depth and 3D reconstruction performance and notable gains in cross-view coherence and rendering quality.

Abstract

Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

TL;DR

MoRe introduces a training-free monocular geometry refinement that achieves cross-view consistency and scale alignment for point maps produced by monocular 3D foundation models. It first performs an affine alignment using inter-view correspondences, then refines the result with a graph-based locally planar optimization that jointly optimizes 3D points and surface normals. This approach preserves the underlying scene structure while mitigating monocular scale ambiguities and improves novel-view synthesis in sparse-view regimes. Across standard benchmarks, MoRe delivers competitive multi-view depth and 3D reconstruction performance and notable gains in cross-view coherence and rendering quality.

Abstract

Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

Paper Structure

This paper contains 24 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Monocular geometry estimation often suffers from scale ambiguity across different views, leading to 3D points with inconsistent scales. To address this, we propose MoRe, a monocular geometry refinement method for aligning point maps across views. We first apply an initial affine transformation using matched 3D points, followed by a novel refinement step. Red lines indicate residual distances between corresponding points. During refinement, instead of directly minimizing least squares error over these correspondences, we introduce a graph-based optimization that yields more accurate and consistent 3D reconstructions.
  • Figure 1: Additional Qualitative Results of MoRe
  • Figure 2: Overview of our proposed method. Given input images and camera poses, we first generate monocular point maps and surface normal maps using a 3D foundation model. We then perform initial alignment using 2D feature correspondences and estimate an affine transformation (scale and shift) to roughly align point maps across views. As shown in the Alignment visualization (top right), the initial alignment brings 3D points into a similar position, but residual errors (red lines) still remain. To further improve consistency, we introduce a graph-based optimization that jointly parameterizes 3D points and surface normals to refine alignment at the pixel level. This refinement significantly reduces residuals and improves geometric coherence across views.
  • Figure 2: Additional Qualitative Results of MoRe
  • Figure 3: Illustration of the proposed geometric constraints for the graph optimization.Graph depicts the abstract graph structure, where nodes indicate 3D points and edges with dotted lines indicate geometric constraints. 3D Space shows the corresponding spatial relationships of the graph structure, where colored rectangles represent local tangent planes and small spheres denote 3D points. Each subfigure presents a distinct type of geometric constraint incorporated into our optimization: (a) Enforces local surface smoothness within the same frame by assuming neighboring 3D points lie on a shared local plane. This regularization is applied within each individual view. (b) Propagates geometric smoothness across frames using 2D point correspondences. Matched points across views are encouraged to lie on a consistent local plane, supporting cross-view surface coherence. (c) Ensures that 3D points align with the viewing rays of their corresponding pixels. Ray consistency allows reprojection consistency of the 3D points within each frame. (d) Applies local surface smoothness constraints across views using 3D K-nearest neighbors (KNN). This provides additional regularization for corresponding points that were not detected by the 2D matcher, helping to achieve more complete alignment.
  • ...and 3 more figures