Table of Contents
Fetching ...

Rethinking Inductive Biases for Surface Normal Estimation

Gwangbin Bae, Andrew J. Davison

TL;DR

The inductive biases needed for surface normal estimation are discussed and a proposal to utilize the per-pixel ray direction and code the relationship between neighboring surface normals by learning their relative rotation is proposed.

Abstract

Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.

Rethinking Inductive Biases for Surface Normal Estimation

TL;DR

The inductive biases needed for surface normal estimation are discussed and a proposal to utilize the per-pixel ray direction and code the relationship between neighboring surface normals by learning their relative rotation is proposed.

Abstract

Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.
Paper Structure (21 sections, 5 equations, 8 figures, 4 tables)

This paper contains 21 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples of challenging in-the-wild images and their surface normals predicted by our method.
  • Figure 2: Motivation. In this paper, we propose to utilize the per-pixel ray direction and estimate the surface normals by learning the relative rotation between nearby pixels. (a) Ray direction serves as a useful cue for pixels near occluding boundaries as the normal should be perpendicular to the ray. (b) It also gives us the range of normals that would be visible, effectively halving the output space. (c) The surface normals of certain scene elements --- in this case, the floor --- may be difficult to estimate due to the lack of visual cues. Nonetheless, we can infer their normals by learning the pairwise relationship between nearby normals (e.g. which surfaces should be perpendicular). (d) Modeling the relative change in surface normals is not just useful for flat surfaces. In this example, the relative angle between the normals of the yellow pixels can be inferred from that of the red pixels assuming circular symmetry.
  • Figure 3: Encoding camera intrinsics.(left) To avoid having to learn camera intrinsics-aware prediction, one can zero-pad or crop the images such that they always have the same intrinsics. (right) Instead, we compute the focal length-normalized image coordinates and provide them as additional input to the network.
  • Figure 4: Ray ReLU activation. An important constraint for surface normal estimation is that the predicted normal should be visible. We achieve this by zeroing out the component that is in the direction of the ray.
  • Figure 5: Network architecture. A lightweight CNN extracts a low-resolution feature map, from which the initial normal, hidden state and context feature are obtained. The hidden state is then recurrently updated using a ConvGRU GRU unit. From the updated hidden state, we estimate three quantities: rotation angle and axis to define a pairwise rotation matrix for each neighboring pixel; and a set of weights that will be used to fuse the rotated normals.
  • ...and 3 more figures