Table of Contents
Fetching ...

Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation

Li Liu, Ruijie Zhu, Jiacheng Deng, Ziyang Song, Wenfei Yang, Tianzhu Zhang

TL;DR

This work tackles monocular depth estimation by introducing Plane2Depth, a plane-guided hierarchical framework that leverages plane priors through plane queries. It combines a plane guided depth generator with an adaptive plane query aggregation module to produce per-pixel plane bases and soft assignments, which are converted to metric depth via the pinhole camera model. The approach achieves state-of-the-art results on NYU-Depth-v2, competitive performance on KITTI, and strong zero-shot generalization to SUN RGB-D, while maintaining efficiency through adaptive feature modulation. Overall, the method robustly models planes to improve depth in low-texture and repetitive regions without sacrificing non-planar region performance, advancing practical monocular depth estimation.

Abstract

Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.

Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation

TL;DR

This work tackles monocular depth estimation by introducing Plane2Depth, a plane-guided hierarchical framework that leverages plane priors through plane queries. It combines a plane guided depth generator with an adaptive plane query aggregation module to produce per-pixel plane bases and soft assignments, which are converted to metric depth via the pinhole camera model. The approach achieves state-of-the-art results on NYU-Depth-v2, competitive performance on KITTI, and strong zero-shot generalization to SUN RGB-D, while maintaining efficiency through adaptive feature modulation. Overall, the method robustly models planes to improve depth in low-texture and repetitive regions without sacrificing non-planar region performance, advancing practical monocular depth estimation.

Abstract

Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.
Paper Structure (20 sections, 15 equations, 11 figures, 8 tables)

This paper contains 20 sections, 15 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration of our motivation. We present an example of "visual deception" in (a). The significant color discrepancies mislead the network into predicting an incorrect depth map. Our method successfully mitigates this issue by using plane information. In (b), pixels within the yellow bounding box correspond to different depths but share the same surface normal (indicated by identical colors representing the corresponding values). Since the generation of ground truth for surface normal depends on conventional algorithms, there exists a slight offset.
  • Figure 2: The overall architecture of Plane2Depth. We use a set of plane queries to predict plane coefficients through E-MLP, N-MLP, and T-MLP, respectively. Then the predicted plane coefficients are converted to metric depth maps through the pinhole camera model. For consistent query prediction, we adopt the APQA module to aggregate multi-scale image features and adaptively modulate them via AF modulators.
  • Figure 3: The visualization of query activation maps between plane queries and image features.Top: The query activation maps in Mask2former cheng2022masked. Bottom: Our query activation maps. The red regions indicate high correlation, while the blue regions indicate low correlation. Our plane queries can adaptively aggregate plane features in the image and predict plane bases in the scene. Each plane query focuses on distinct plane regions in the scene.
  • Figure 4: Illustration of the point-normal form equation.$o$ represents the camera center, $N(p)$ represents the surface normal vector starting from $X$, $X{'}$ can be an arbitrary point on the plane, and $T(p)$ denotes the distance from the camera to the plane.
  • Figure 5: Qualitative results on NYU-Depth-v2 dataset. Each column corresponds to a method. The predicted depth map is on the left and the error map is on the right. We apply the coolwarm colormap to visualize the error map, with values clipped at 0.5. Blue indicates low error, while red signifies high error. In row 1-3, we observe a clear advantage in our predicted maps for repetitive regions. Similarly, in row 4-6, our method effectively addresses challenges in areas with weak texture.
  • ...and 6 more figures