Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation
Li Liu, Ruijie Zhu, Jiacheng Deng, Ziyang Song, Wenfei Yang, Tianzhu Zhang
TL;DR
This work tackles monocular depth estimation by introducing Plane2Depth, a plane-guided hierarchical framework that leverages plane priors through plane queries. It combines a plane guided depth generator with an adaptive plane query aggregation module to produce per-pixel plane bases and soft assignments, which are converted to metric depth via the pinhole camera model. The approach achieves state-of-the-art results on NYU-Depth-v2, competitive performance on KITTI, and strong zero-shot generalization to SUN RGB-D, while maintaining efficiency through adaptive feature modulation. Overall, the method robustly models planes to improve depth in low-texture and repetitive regions without sacrificing non-planar region performance, advancing practical monocular depth estimation.
Abstract
Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.
