Table of Contents
Fetching ...

From Editor to Dense Geometry Estimator

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu

Abstract

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

From Editor to Dense Geometry Estimator

Abstract

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100 data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

Paper Structure

This paper contains 36 sections, 19 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We present FE2E, a DiT-based foundation model for monocular dense geometry prediction. Trained with limited supervision, FE2E achieves promising performance improvements in zero-shot depth and normal estimation. Bar length indicates the average ranking across all metrics from multiple datasets, where lower values are better. ★ represents the amount of training data used.
  • Figure 2: FE2E Adaptation Pipeline. The grey background shows the original editor's workflow, while the other details FE2E: ① A pre-trained VAE encodes the logarithmically quantized depth $\mathbf{d}$, input image $\mathbf{x}$, and normals $\mathbf{n}$ into latent space. ② The DiT $f_\theta$ learns a constant velocity $\mathbf{v}$ from a fixed origin $\mathbf{z}^y_0$ to the target latent $\mathbf{z}^y_1$, independent of $t$ or instructions. ③ By repurposing the discarded output region, FE2E jointly predicts depth and normals without extra computation. Training loss is computed in the latent space, with final predictions decoded by VAE only at inference.
  • Figure 3: Comparison between the Generative and Editing foundation models. We analyze the feature evolution at both the initial (Epoch 1) and final (Epoch 30) stages of fine-tuning, resulting in 4 groups. Each group presents: the DiT features at the input end (Block1), middle layers (Block20), output end (Block35), and the depth prediction's AbsRel (Absolute Relative error). Visual implementation detailed in Sec B.
  • Figure 4: Quantitative comparison of the training loss between Generative and Editing foundation models. The main plot details the convergence loss from epoch 5 to 30, while the inset displays the steep initial loss reduction during the first 10 epochs, which occurs on a different scale.
  • Figure 5: Left: GT velocity field for network training. The gray dots represent different Gaussian noise (top) or zero starting point (bottom), the red dots represent data samples. Right: Instantaneous velocity $v$ determines the tangent direction and creates errors in the cumulative path (top); The constant speed path is a straight line.
  • ...and 5 more figures