Table of Contents
Fetching ...

StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, Xiaoguang Han

TL;DR

StableNormal tackles the image-to-normal problem by mitigating diffusion-based stochasticity to deliver stable and sharp surface normals without ensembling. It introduces a two-stage pipeline: a YOSO initialization with a Shrinkage Regularizer to establish a reliable base, followed by SG-DRN semantic-guided refinement that leverages DINO priors for global-consistent detail, plus a DDIM-inspired heuristic sampler. Across indoor benchmarks (DIODE-indoor, iBims, ScannetV2, NYUv2), it outperforms state-of-the-art baselines in stability and achieves competitive accuracy, with clear benefits for downstream tasks such as multi-view and monocular surface reconstruction and normal enhancement. The work demonstrates practical potential for diffusion-prior-based geometric estimation, and provides public code and models to facilitate broader adoption and further research.

Abstract

This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X

StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

TL;DR

StableNormal tackles the image-to-normal problem by mitigating diffusion-based stochasticity to deliver stable and sharp surface normals without ensembling. It introduces a two-stage pipeline: a YOSO initialization with a Shrinkage Regularizer to establish a reliable base, followed by SG-DRN semantic-guided refinement that leverages DINO priors for global-consistent detail, plus a DDIM-inspired heuristic sampler. Across indoor benchmarks (DIODE-indoor, iBims, ScannetV2, NYUv2), it outperforms state-of-the-art baselines in stability and achieves competitive accuracy, with clear benefits for downstream tasks such as multi-view and monocular surface reconstruction and normal enhancement. The work demonstrates practical potential for diffusion-prior-based geometric estimation, and provides public code and models to facilitate broader adoption and further research.

Abstract

This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X
Paper Structure (25 sections, 10 equations, 16 figures, 5 tables)

This paper contains 25 sections, 10 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Comparative Analysis of Normal Estimators: "Stability" vs. "Sharpness". One-step GenPercept compromises the high-frequency details and produces overly-smooth normals for objects on the table, while GeoWizard produces seemingly sharp normals, but neither correct nor stable. Our method well balances stability and sharpness. The red boxes highlight the visual difference mentioned above.
  • Figure 2: High-variance normal estimations. We show multiple samples for a single scene and visualize the mean and variance of the predicted normals. For each sample, while the normal maps exhibit sharp details, there is high variance in areas with high-frequency content. This high variance in sharp regions makes the inference less reliable.
  • Figure 3: Overview of the StableNormal. The overall pipeline is composed of two stages: 1) YOSO aims to produce a confident initialization $x_{t^+}$ for stage two with a novel Shrinkage Regularizer; 2) SG-DRN plays the role of stable denoising, by leveraging the stronger semantic control information extracted from DINO oquab2024dinov2. The textual prompt for the U-Net in both stages is set to "The normal map".
  • Figure 4: The comparison of output variance and inference time between our method, GeoWizard, and Marigold. The left plot shows the output variance over ensemble time, while the right plot displays the inference time (including ensembling). It is important to note that our method does not employ the ensemble strategy and only requires a single forward pass.
  • Figure 5: Qualitative Ablation Study.YOSO can produce relatively sharp surface normal estimations with only a single-step sampling; however, its results still lack sufficient details. After refinement by SG-DRN, the predicted surface normals become significantly sharper, as illustrated by the comparison between the third and fourth columns in the figure. This comparison highlights the impact of semantic features on SG-DRN's performance. Specifically, the first row demonstrates how using DINO features assists the network in mitigating the effects of lighting on normal estimation. The second row indicates that DINO features enable effective structural modeling, enhancing the consistency of the normal output. Furthermore, the third row shows that DINO features improve the network's ability to understand materials, e.g., plastic material.
  • ...and 11 more figures