Generative Edge Detection with Stable Diffusion
Caixia Zhou, Yaping Huang, Mochu Xiang, Jiahui Ren, Haibin Ling, Jing Zhang
TL;DR
This work introduces Generative Edge Detector (GED), a method that repurposes a pre-trained Stable Diffusion model to perform edge detection in the latent space, predicting latent edge maps directly rather than performing multi-step denoising. By encoding input images and ground-truth edges into latent representations, conditioning on a text caption, time step, and a granular edge prompt, and finetuning only the final stages of the denoising U-Net, GED achieves state-of-the-art edge detection results across multiple datasets with significantly reduced training and inference cost. The approach also supports multiple granularity outputs with an explicit granularity regularization to maintain ordinal relationships among predictions, enabling diverse and controllable edge maps without heavy post-processing. Overall, GED demonstrates that leveraging rich priors from large diffusion models can substantially improve dense prediction tasks with minimal supervised data and fast inference, with practical implications for real-time perception systems.
Abstract
Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods. Recently, generative edge detection methods, especially diffusion model based solutions, are initialized in the edge detection task. Despite great potential, the retraining of task-specific designed modules and multi-step denoising inference limits their broader applications. Upon closer investigation, we speculate that part of the reason is the under-exploration of the rich discriminative information encoded in extensively pre-trained large models (\eg, stable diffusion models). Thus motivated, we propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model. Our model can be trained and inferred efficiently without specific network design due to the rich high-level and low-level prior knowledge empowered by the pre-trained stable diffusion. Specifically, we propose to finetune the denoising U-Net and predict latent edge maps directly, by taking the latent image feature maps as input. Additionally, due to the subjectivity and ambiguity of the edges, we also incorporate the granularity of the edges into the denoising U-Net model as one of the conditions to achieve controllable and diverse predictions. Furthermore, we devise a granularity regularization to ensure the relative granularity relationship of the multiple predictions. We conduct extensive experiments on multiple datasets and achieve competitive performance (\eg, 0.870 and 0.880 in terms of ODS and OIS on the BSDS test dataset).
