Table of Contents
Fetching ...

Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang

Abstract

In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.

Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Abstract

In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. Conventional feed-forward methods rely on massive training data, yet still miss details. Previous diffusion-based methods leverage rich generative priors yet struggle with synthetic-to-real domain transfer. Iris, in contrast, preserves fine details, generalizes strongly from synthetic to real scenes, and remains efficient with limited training data. To this end, we introduce a two-stage Priors-to-Geometry Deterministic (PGD) schedule: the prior stage uses Spectral-Gated Distillation (SGD) to transfer low-frequency real priors while leaving high-frequency details unconstrained, and the geometry stage applies Spectral-Gated Consistency (SGC) to enforce high-frequency fidelity while refining with synthetic ground truth. The two stages share weights and are executed with a high-to-low timestep schedule. Extensive experimental results confirm that Iris achieves significant improvements in MDE performance with strong in-the-wild generalization.
Paper Structure (19 sections, 12 equations, 11 figures, 5 tables)

This paper contains 19 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison of DAv2 and diffusion-based method. (a) Input. (b) DAv2 yang2024depthv2 yields accurate global layout and scale but smoother details. (d) The diffusion-based method (i.e., Lotus he2025lotus) preserves fine details and sharper boundaries. This complementarity motivates our Priors-to-Geometry Deterministic (§\ref{['sec: PGD']}) framework; spectral disparity further motivates Spectral-Gated Distillation (§\ref{['sec: SGD']}), which transfers reliable low-frequency real-image priors while deferring high-frequency details.
  • Figure 2: Comparison of direct stage-1 and stage-2 outputs. (a) Input. (b) Unexpectedly, stage-1 operating at a high timestep with low-pass prior alignment produces crisp boundaries and richer textures. (d) The low-timestep stage-2 refined with synthetic ground truth yields smoother boundaries and more stable geometry. (c) Cumulative spectrum shows that stage-1 carries stronger high-frequency energy. These observations motivate using stage-1 as a high-frequency teacher via Spectral-Gated Consistency (§\ref{['sec: SGC']}).
  • Figure 3: Iris overview. Iris introduces a two-stage diffusion-based Priors-to-Geometry Deterministic framework that effectively injects real-world priors into the diffusion model. First prior stage injects real-world priors from a frozen teacher under a high-timestep state, while the second geometry stage refines metrically faithful predictions on synthetic supervision at a low-timestep state. In the prior stage, Spectral-Gated Distillation (§\ref{['sec: SGD']}) uses a lightweight low-pass gate to filter noisy teacher predictions into stable low-frequency layout priors, whereas in the geometry stage, Spectral-Gated Consistency (§\ref{['sec: SGC']}) applies a lightweight high-pass gate to transfer sharp boundaries and fine details from stage-1 to stage-2. The two U-Net blocks share weights. Please refer to §\ref{['sec:method']} for more details.
  • Figure 4: Visualization of Spectral-Gated Distillation. SGD aligns teacher and student in the low-frequency band, injecting real-world priors for layout and scale, suppressing high-frequency artifacts, and leaving high-frequency components unconstrained for next-stage refinement. See §\ref{['sec: SGD']} for more details.
  • Figure 5: Visualization of Spectral-Gated Consistency. Stage-1 naturally yields crisp detail and boundary cues. To leverage these internal cues, SGC encourages agreement between stages in the high-frequency band. See §\ref{['sec: SGC']} for more details.
  • ...and 6 more figures