Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
Yiqing Shi, Yiren Song, Mike Zheng Shou
TL;DR
Edit2Perceive reframes dense perception as a deterministic image-editing diffusion task, arguing that image-to-image priors provide stronger geometric reasoning than traditional text-to-image priors. It builds on FLUX.1 Kontext to map inputs (image, text) to dense outputs (depth, normals, matting) via a fixed-seed, single-step diffusion path and a pixel-space consistency loss, achieving state-of-the-art zero-shot performance with limited data. The key contributions include the latent-space flow matching objective, a per-task pixel-space loss, a theoretically grounded square-root depth mapping, and efficient single-step inference. Together, these advances suggest diffusion-based editors can serve as practical, geometry-aware perception foundations with broad applicability and improved efficiency.
Abstract
Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm and show that image editing diffusion models are inherently image-to-image consistent, providing a more suitable foundation for dense perception task. We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. Built upon the FLUX.1 Kontext architecture, our approach employs full-parameter fine-tuning and a pixel-space consistency loss to enforce structure-preserving refinement across intermediate denoising states. Moreover, our single-step deterministic inference yields up to faster runtime while training on relatively small datasets. Extensive experiments demonstrate comprehensive state-of-the-art results across all three tasks, revealing the strong potential of editing-oriented diffusion transformers for geometry-aware perception.
