Table of Contents
Fetching ...

Fractal Autoregressive Depth Estimation with Continuous Token Diffusion

Jinchang Zhang, Xinrou Kang, Guoyu Lu

Abstract

Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.

Fractal Autoregressive Depth Estimation with Continuous Token Diffusion

Abstract

Monocular depth estimation can benefit from autoregressive (AR) generation, but direct AR modeling is hindered by the modality gap between RGB and depth, inefficient pixel-wise generation, and instability in continuous depth prediction. We propose a Fractal Visual Autoregressive Diffusion framework that reformulates depth estimation as a coarse-to-fine, next-scale autoregressive generation process. A VCFR module fuses multi-scale image features with current depth predictions to improve cross-modal conditioning, while a conditional denoising diffusion loss models depth distributions directly in continuous space and mitigates errors caused by discrete quantization. To improve computational efficiency, we organize the scale-wise generators into a fractal recursive architecture, reusing a base visual AR unit in a self-similar hierarchy. We further introduce an uncertainty-aware robust consensus aggregation scheme for multi-sample inference to improve fusion stability and provide a practical pixel-wise reliability estimate. Experiments on standard benchmarks demonstrate strong performance and validate the effectiveness of the proposed design.
Paper Structure (19 sections, 4 equations, 6 figures, 8 tables)

This paper contains 19 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Given an input RGB image $X \in \mathbb{R}^{H \times W \times 3}$, we first extract multi-scale visual features $Z(X)$ using an image encoder and a multi-scale aggregation module, which serve as global conditioning throughout depth generation. The framework is organized as a fractal recursive hierarchy of four nested scale-wise generators $\{g_4, g_3, g_2, g_1\}$, which progressively recover depth representations from coarse to fine and output the final depth map $\hat{D} \in \mathbb{R}^{H \times W}$. Specifically, $g_4$ starts from a coarse latent token and predicts an initial depth latent, which is then passed to $g_3$, followed by $g_2$, and finally $g_1$. The top-left panel illustrates the fractal recursion pattern, while the bottom panel shows the concrete generation pipeline. At each scale, a VCFR block fuses image features with the current-scale depth latent/token to form a Visual-Depth Joint token for next-scale prediction. Within each module, depth is predicted via a visual autoregressive diffusion process, where a conditional denoising diffusion objective models the continuous depth distribution at the current scale; the noise predictor $\epsilon_{\theta}(x_t \mid t, z)$ takes diffusion timestep $t$ and condition vector $z$ and is optimized with diffusion loss.
  • Figure 2: Different stages of depth in a fractal framework.
  • Figure 3: Visual results on NYU. From left to right: input images, depth estimation from ground truth, DiffusionDepth duan2024diffusiondepth, Repurposing Diffusion ke2023repurposing, ours.
  • Figure 4: Uncertainty distribution shifts under multi-sample inference with different numbers of runs ($N=1,2,4,8$). In each subplot, the x-axis denotes the pixel-wise uncertainty proxy $u$ and the y-axis denotes density. The dashed vertical line indicates the threshold $u=1.0$.
  • Figure 5: Effect of the number of inference runs $N$ on depth estimation performance on KITTI geiger2013vision.
  • ...and 1 more figures