Table of Contents
Fetching ...

Monocular Normal Estimation via Shading Sequence Estimation

Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai

TL;DR

RoSE reframes monocular normal estimation as shading sequence estimation to overcome 3D misalignment. A video diffusion model predicts a shading sequence under a predefined ring-light path from a single grayscale input, and the normal map is recovered analytically via ordinary least squares as $\mathbf{N} = (\mathbf{L}^T \mathbf{L})^{-1} \mathbf{L}^T \mathbf{S}^s$. Trained on the diverse MultiShade dataset, RoSE achieves state-of-the-art performance on real benchmarks (DiLiGenT, LUCES) and demonstrates robust generalization to unseen objects and materials. The approach yields finer geometric details and better 3D alignment while leveraging strong lighting priors encoded in video diffusion models. This shading-sequence paradigm offers a principled, geometry-sensitive alternative to direct normal-map prediction for monocular normal estimation.

Abstract

Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.

Monocular Normal Estimation via Shading Sequence Estimation

TL;DR

RoSE reframes monocular normal estimation as shading sequence estimation to overcome 3D misalignment. A video diffusion model predicts a shading sequence under a predefined ring-light path from a single grayscale input, and the normal map is recovered analytically via ordinary least squares as . Trained on the diverse MultiShade dataset, RoSE achieves state-of-the-art performance on real benchmarks (DiLiGenT, LUCES) and demonstrates robust generalization to unseen objects and materials. The approach yields finer geometric details and better 3D alignment while leveraging strong lighting priors encoded in video diffusion models. This shading-sequence paradigm offers a principled, geometry-sensitive alternative to direct normal-map prediction for monocular normal estimation.

Abstract

Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
Paper Structure (31 sections, 1 theorem, 5 equations, 18 figures, 17 tables)

This paper contains 31 sections, 1 theorem, 5 equations, 18 figures, 17 tables.

Key Result

Lemma 1

A point is considered illuminated when $\max(0, \mathbf{S}) > 0$, then a single parallel light covers at least half of the upper hemisphere. Thus, $n=2$ lights are sufficient to ensure that every point on the sphere is illuminated at least once. In order to guarantee that every point is covered by a

Figures (18)

  • Figure 1: We present RoSE, a method using a video generative model for monocular normal map estimation, built on a new paradigm that reformulates normal estimation as a shading sequence estimation task. Results on complex and diverse scenarios show that RoSE reconstructs fine-grained geometric details and generalizes robustly to unseen datasets, achieving state-of-the-art performance in object-based monocular normal estimation on benchmark datasets.
  • Figure 2: Illustration of 3D misalignment. The estimated normal maps of previous methods may appear to have an overall correct color distribution, yet the reconstructed surfaces often fail to align with the accurate geometry details, showing over-smooth results. Our estimated normal map has a higher 3D alignment compared to others.
  • Figure 3: Validation of sensitivity to geometry variations for different representations, including the proposed shading sequence (left) and the normal map (right), measured by average total variation (TV). TV is computed as the mean magnitude of the first-order image's gradient in terms of different representations, where higher TV indicates stronger sensitivity to spatial geometric variation.
  • Figure 4: Ring light setup.
  • Figure 5: Pipeline of RoSE. Given a monocular RGB image under arbitrary lighting, RoSE first converts it into a grayscale image, which is then used to generate a consistent sequence of a multi-light shading sequence via a video diffusion model. This generation is guided by two complementary feature representations extracted from a CLIP encoder and a VAE encoder. Finally, an ordinary least squares problem is solved using an analytical solver to estimate the normal map from the generated shading sequence. We train the video diffusion model while freezing the CLIP and the VAE encoder.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Lemma 1