Table of Contents
Fetching ...

Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images

Xiaoxiao Long, Yuhang Zheng, Yupeng Zheng, Beiwen Tian, Cheng Lin, Lingjie Liu, Hao Zhao, Guyue Zhou, Wenping Wang

TL;DR

This work addresses monocular depth and surface-normal estimation by introducing an Adaptive Surface Normal (ASN) constraint that jointly enforces depth-normal consistency through a learned geometric context. The method samples local triplets to generate multiple normal candidates, then adaptively weighs them using a geometry-aware confidence and area-based factors, while a geometric context-guided normal estimator refines normals in detail-rich regions. A transformer-based network with depth, guidance, and normal decoders learns to predict coherent 3D structure and high-fidelity point clouds across indoor and outdoor datasets, outperforming state-of-the-art methods on depth, normals, and 3D geometry metrics. The approach offers robust, efficient 3D reconstruction from monocular images and highlights the value of explicit geometric context in guiding both depth and normal estimation for practical applications in 3D vision and robotics.

Abstract

We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context. The difficulty of reliably capturing geometric context in existing methods impedes their ability to accurately enforce the consistency between the different geometric properties, thereby leading to a bottleneck of geometric estimation quality. We therefore propose the Adaptive Surface Normal (ASN) constraint, a simple yet efficient method. Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints. By dynamically determining reliable local geometry from randomly sampled candidates, we establish a surface normal constraint, where the validity of these candidates is evaluated using the geometric context. Furthermore, our normal estimation leverages the geometric context to prioritize regions that exhibit significant geometric variations, which makes the predicted normals accurately capture intricate and detailed geometric information. Through the integration of geometric context, our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images. We validate the superiority of our approach over state-of-the-art methods through extensive evaluations and comparisons on diverse indoor and outdoor datasets, showcasing its efficiency and robustness.

Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images

TL;DR

This work addresses monocular depth and surface-normal estimation by introducing an Adaptive Surface Normal (ASN) constraint that jointly enforces depth-normal consistency through a learned geometric context. The method samples local triplets to generate multiple normal candidates, then adaptively weighs them using a geometry-aware confidence and area-based factors, while a geometric context-guided normal estimator refines normals in detail-rich regions. A transformer-based network with depth, guidance, and normal decoders learns to predict coherent 3D structure and high-fidelity point clouds across indoor and outdoor datasets, outperforming state-of-the-art methods on depth, normals, and 3D geometry metrics. The approach offers robust, efficient 3D reconstruction from monocular images and highlights the value of explicit geometric context in guiding both depth and normal estimation for practical applications in 3D vision and robotics.

Abstract

We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context. The difficulty of reliably capturing geometric context in existing methods impedes their ability to accurately enforce the consistency between the different geometric properties, thereby leading to a bottleneck of geometric estimation quality. We therefore propose the Adaptive Surface Normal (ASN) constraint, a simple yet efficient method. Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints. By dynamically determining reliable local geometry from randomly sampled candidates, we establish a surface normal constraint, where the validity of these candidates is evaluated using the geometric context. Furthermore, our normal estimation leverages the geometric context to prioritize regions that exhibit significant geometric variations, which makes the predicted normals accurately capture intricate and detailed geometric information. Through the integration of geometric context, our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images. We validate the superiority of our approach over state-of-the-art methods through extensive evaluations and comparisons on diverse indoor and outdoor datasets, showcasing its efficiency and robustness.
Paper Structure (54 sections, 8 equations, 22 figures, 12 tables)

This paper contains 54 sections, 8 equations, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Taking a monocular RGB image as input, our approach first produces a geometric context map that encodes 3D geometric variances, and then jointly predicts depth and surface normal in a geometry-aware manner. Specifically, we rely on the geometric context to enforce Adaptive Surface Normal constraint on the predicted depth, which enables the predicted depth to faithfully preserve 3D geometry, thus yielding a high-quality point cloud converted from the depth. Meanwhile, we perform surface normal estimation guided by the geometric context, and the predicted normal map can keep rich geometric details.
  • Figure 2: Sobel-like operator versus ours for surface normal calculation. The Sobel-like operator first calculates two principle vectors along up-down and left-right directions, and then use their cross product to estimate the normal. Ours first computes the normal vectors of the randomly sampled triplets, and then adaptively combines them together to obtain the final result.
  • Figure 3: Overview of our method. Taking a single image as input, our model produces depth maps, geometric context, and surface normal maps from three decoders, respectively. We recover surface normal from the predicted depth map with our proposed Adaptive Surface Normal (ASN) computation method. The similarity kernels computed from geometric context enable our surface normal calculation to be aware of local geometry, like shape boundaries and corners. Furthermore, geometric context encodes the rich geometric variances which the predicted surface normal usually struggles to capture. Thus, we design an approach, using the geometric context to guide the surface normal estimation. Finally, pixel-wise depth/normal supervision is enforced on the predicted depth/normal, while the geometric supervision is enforced on the recovered surface normal.
  • Figure 4: The structure of our proposed normal-estimation approach. Taking the feature maps produced by the encoder as input, the module first generates initial surface normal predictions using a normal head and then concatenates the encoder feature maps with the surface normals. Next, we use geometric context to guide the sampling of pixels that locates in regions with rich geometric details. The sampled pixels are taken into an MLP to output pixel-wise refined surface normals. Finally, the initial normals of the sampled pixels are replaced with the newly refined normals.
  • Figure 5: Some samples of SVERS dataset and MVS-SYNYH dataset. We mask the regions whose depth values are larger than 80m as black for better visualization. Although both datasets are generated from the GTA-V game, SVERS contains diverse driving scenarios with vehicles, motorcycles, and pedestrians in urban, overpasses, and countryside environments, while MVS-SYNTH mainly consists of buildings, vehicles, and pedestrians in urban areas with varying viewpoints.
  • ...and 17 more figures