Table of Contents
Fetching ...

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

Yiqing Shi, Yiren Song, Mike Zheng Shou

TL;DR

Edit2Perceive reframes dense perception as a deterministic image-editing diffusion task, arguing that image-to-image priors provide stronger geometric reasoning than traditional text-to-image priors. It builds on FLUX.1 Kontext to map inputs (image, text) to dense outputs (depth, normals, matting) via a fixed-seed, single-step diffusion path and a pixel-space consistency loss, achieving state-of-the-art zero-shot performance with limited data. The key contributions include the latent-space flow matching objective, a per-task pixel-space loss, a theoretically grounded square-root depth mapping, and efficient single-step inference. Together, these advances suggest diffusion-based editors can serve as practical, geometry-aware perception foundations with broad applicability and improved efficiency.

Abstract

Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

TL;DR

Edit2Perceive reframes dense perception as a deterministic image-editing diffusion task, arguing that image-to-image priors provide stronger geometric reasoning than traditional text-to-image priors. It builds on FLUX.1 Kontext to map inputs (image, text) to dense outputs (depth, normals, matting) via a fixed-seed, single-step diffusion path and a pixel-space consistency loss, achieving state-of-the-art zero-shot performance with limited data. The key contributions include the latent-space flow matching objective, a per-task pixel-space loss, a theoretically grounded square-root depth mapping, and efficient single-step inference. Together, these advances suggest diffusion-based editors can serve as practical, geometry-aware perception foundations with broad applicability and improved efficiency.

Abstract

Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.

Paper Structure

This paper contains 43 sections, 24 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview of the Editor2Perceive Framework. We adapt the FLUX.1 Kontext editor for dense perception. An input image $x$ and text prompt $p$ condition for the target image $y$. In the forward process, target token will be adding noise $z_t$ and then concat with text condition token and image condition token. The DiT backbone is trained to predict the velocity of the flow from a noise vector $z_0$ to the target latent $z_1$. Our training objective combines the latent-space flow matching loss ($\mathcal{L}_{\text{FM}}$) with a pixel-space consistency loss ($\mathcal{L}_{\text{Cons}}$) for enhanced geometric fidelity.
  • Figure 2: Qualitative Comparison of our methods with other SOTA methods across different benchmarks. The arrows emphasize the regions that Edit2Percieve (ours) significantly outperform others. Zoom in for better view.
  • Figure 3: Effectiveness of the Pixel-Space Consistency Loss ($\mathcal{L}_{\text{Cons}}$) across All Tasks. The radar charts compare the performance with (solid line) and without (dashed line) our consistency loss. For each task, axes represent key metrics on different datasets (lower is better for error metrics like AbsRel, Mean, MAD, SAD; higher is better for accuracy metrics like $\delta_1$, 11.25$^{\circ}$).
  • Figure 4: Ablation study on inference steps.
  • Figure 5: Additional Qualitative Comparisons for Zero-Shot Monocular Depth Estimation. Our method consistently produces more detailed and structurally coherent depth maps compared to other state-of-the-art methods across a variety of challenging indoor and outdoor scenes.
  • ...and 6 more figures