Scalable Autoregressive Monocular Depth Estimation
Jinhong Wang, Jian Liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Danny Chen, Jintai Chen, Jian Wu
TL;DR
This work addresses monocular depth estimation by reframing depth prediction as two autoregressive processes: (i) a resolution autoregression that generates depth maps from low to high spatial resolution, and (ii) a granularity autoregression that refines depth values from coarse to fine bins. The Depth AutoRegressive (DAR) model uses a Transformer-based DAR Transformer with a patch-wise causal mask, and introduces the Multiway Tree Bins (MTBin) strategy plus a Bin Injection module to integrate progressively finer depth candidates. DAR achieves state-of-the-art results on KITTI and NYU Depth v2, scales up to 2.0B parameters, and demonstrates zero-shot generalization to unseen datasets, suggesting a scalable path for AR-based depth estimation and integration with autoregressive foundation models. These findings indicate that autoregressive paradigms can effectively handle dense depth prediction while preserving generalization and scalability advantages typical of AR models in vision language and multi-modal settings.
Abstract
This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.
