Table of Contents
Fetching ...

Scalable Autoregressive Monocular Depth Estimation

Jinhong Wang, Jian Liu, Dongqi Tang, Weiqiang Wang, Wentong Li, Danny Chen, Jintai Chen, Jian Wu

TL;DR

This work addresses monocular depth estimation by reframing depth prediction as two autoregressive processes: (i) a resolution autoregression that generates depth maps from low to high spatial resolution, and (ii) a granularity autoregression that refines depth values from coarse to fine bins. The Depth AutoRegressive (DAR) model uses a Transformer-based DAR Transformer with a patch-wise causal mask, and introduces the Multiway Tree Bins (MTBin) strategy plus a Bin Injection module to integrate progressively finer depth candidates. DAR achieves state-of-the-art results on KITTI and NYU Depth v2, scales up to 2.0B parameters, and demonstrates zero-shot generalization to unseen datasets, suggesting a scalable path for AR-based depth estimation and integration with autoregressive foundation models. These findings indicate that autoregressive paradigms can effectively handle dense depth prediction while preserving generalization and scalability advantages typical of AR models in vision language and multi-modal settings.

Abstract

This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.

Scalable Autoregressive Monocular Depth Estimation

TL;DR

This work addresses monocular depth estimation by reframing depth prediction as two autoregressive processes: (i) a resolution autoregression that generates depth maps from low to high spatial resolution, and (ii) a granularity autoregression that refines depth values from coarse to fine bins. The Depth AutoRegressive (DAR) model uses a Transformer-based DAR Transformer with a patch-wise causal mask, and introduces the Multiway Tree Bins (MTBin) strategy plus a Bin Injection module to integrate progressively finer depth candidates. DAR achieves state-of-the-art results on KITTI and NYU Depth v2, scales up to 2.0B parameters, and demonstrates zero-shot generalization to unseen datasets, suggesting a scalable path for AR-based depth estimation and integration with autoregressive foundation models. These findings indicate that autoregressive paradigms can effectively handle dense depth prediction while preserving generalization and scalability advantages typical of AR models in vision language and multi-modal settings.

Abstract

This paper shows that the autoregressive model is an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.

Paper Structure

This paper contains 15 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: We exploit two "order" properties of the MDE task that can be transformed into two autoregressive objectives. (a) Resolution autoregressive objective: The generation of depth maps can follow a resolution order from low to high. For each step of the resolution autoregressive process, the Transformer predicts the next higher-resolution token map conditioned on all the previous ones. (b) Granularity autoregressive objective: The range of depth values is ordered, from 0 to specific max values. For each step of the granularity autoregressive process, we increase exponentially the number of bins (e.g., doubling the bin number), and utilize the previous predictions to predict a more refined depth with a smaller and more refined granularity. Our proposed DAR aims to perform these two autoregressive processes simultaneously.
  • Figure 2: RMSE performances ($\downarrow$) vs. model sizes on the KITTI dataset. Our DAR shows strong scalability and achieves better performance-efficiency trade-off among cutting-edge methods.
  • Figure 3: An overview of DAR. We begin with encoding the input RGB images into image tokens as the context condition. At each step, DAR Transformer with the patch-wise causal mask performs autoregressive predictions, that is, it allows the input token map (upsampled from the previous resolution token map $r_{k-1}$) to interact with only the prefix tokens and global image feature tokens for the next-resolution token map modeling. The output latent tokens are then sent to the ConvGRU module, which injects the prompts of new refined bin candidates $\mathbf{c}_{k}$ (generated by MTBin from the previous prediction $\tilde{D}_{k-1}$) for further granularity guidance and generates the next-resolution token map $r_{k}$. The new depth map $\tilde{D}_{k}$ is generated by a linear combination of the next-granularity bin candidates $\mathbf{c}_{k}$ and softmax value $\mathbf{p}_{k}$ of the next-resolution token map, achieving concurrently a resolution and granularity autoregressive evolution.
  • Figure 4: Illustrating the patch-wise causal mask for ensuring that the current token map can interact only with tokens from itself and the prefix token maps.
  • Figure 5: A schematic diagram of the multiway tree bins strategy.
  • ...and 2 more figures