Table of Contents
Fetching ...

Generative Edge Detection with Stable Diffusion

Caixia Zhou, Yaping Huang, Mochu Xiang, Jiahui Ren, Haibin Ling, Jing Zhang

TL;DR

This work introduces Generative Edge Detector (GED), a method that repurposes a pre-trained Stable Diffusion model to perform edge detection in the latent space, predicting latent edge maps directly rather than performing multi-step denoising. By encoding input images and ground-truth edges into latent representations, conditioning on a text caption, time step, and a granular edge prompt, and finetuning only the final stages of the denoising U-Net, GED achieves state-of-the-art edge detection results across multiple datasets with significantly reduced training and inference cost. The approach also supports multiple granularity outputs with an explicit granularity regularization to maintain ordinal relationships among predictions, enabling diverse and controllable edge maps without heavy post-processing. Overall, GED demonstrates that leveraging rich priors from large diffusion models can substantially improve dense prediction tasks with minimal supervised data and fast inference, with practical implications for real-time perception systems.

Abstract

Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods. Recently, generative edge detection methods, especially diffusion model based solutions, are initialized in the edge detection task. Despite great potential, the retraining of task-specific designed modules and multi-step denoising inference limits their broader applications. Upon closer investigation, we speculate that part of the reason is the under-exploration of the rich discriminative information encoded in extensively pre-trained large models (\eg, stable diffusion models). Thus motivated, we propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model. Our model can be trained and inferred efficiently without specific network design due to the rich high-level and low-level prior knowledge empowered by the pre-trained stable diffusion. Specifically, we propose to finetune the denoising U-Net and predict latent edge maps directly, by taking the latent image feature maps as input. Additionally, due to the subjectivity and ambiguity of the edges, we also incorporate the granularity of the edges into the denoising U-Net model as one of the conditions to achieve controllable and diverse predictions. Furthermore, we devise a granularity regularization to ensure the relative granularity relationship of the multiple predictions. We conduct extensive experiments on multiple datasets and achieve competitive performance (\eg, 0.870 and 0.880 in terms of ODS and OIS on the BSDS test dataset).

Generative Edge Detection with Stable Diffusion

TL;DR

This work introduces Generative Edge Detector (GED), a method that repurposes a pre-trained Stable Diffusion model to perform edge detection in the latent space, predicting latent edge maps directly rather than performing multi-step denoising. By encoding input images and ground-truth edges into latent representations, conditioning on a text caption, time step, and a granular edge prompt, and finetuning only the final stages of the denoising U-Net, GED achieves state-of-the-art edge detection results across multiple datasets with significantly reduced training and inference cost. The approach also supports multiple granularity outputs with an explicit granularity regularization to maintain ordinal relationships among predictions, enabling diverse and controllable edge maps without heavy post-processing. Overall, GED demonstrates that leveraging rich priors from large diffusion models can substantially improve dense prediction tasks with minimal supervised data and fast inference, with practical implications for real-time perception systems.

Abstract

Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods. Recently, generative edge detection methods, especially diffusion model based solutions, are initialized in the edge detection task. Despite great potential, the retraining of task-specific designed modules and multi-step denoising inference limits their broader applications. Upon closer investigation, we speculate that part of the reason is the under-exploration of the rich discriminative information encoded in extensively pre-trained large models (\eg, stable diffusion models). Thus motivated, we propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model. Our model can be trained and inferred efficiently without specific network design due to the rich high-level and low-level prior knowledge empowered by the pre-trained stable diffusion. Specifically, we propose to finetune the denoising U-Net and predict latent edge maps directly, by taking the latent image feature maps as input. Additionally, due to the subjectivity and ambiguity of the edges, we also incorporate the granularity of the edges into the denoising U-Net model as one of the conditions to achieve controllable and diverse predictions. Furthermore, we devise a granularity regularization to ensure the relative granularity relationship of the multiple predictions. We conduct extensive experiments on multiple datasets and achieve competitive performance (\eg, 0.870 and 0.880 in terms of ODS and OIS on the BSDS test dataset).
Paper Structure (17 sections, 4 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Qualitative results of our proposed GED with different edge granularity on the BSDS arbelaez2010contour test dataset.
  • Figure 2: Different ways of using the diffusion model. (a) is a conditional DDPM model with images as the condition training from scratch; (b) finetunes a conditional latent DDPM model with pre-trained parameters from the stable diffusion model; and (c) aligns edge predictions with ground truth in the latent space by pre-trained DDPM. On the BSDS arbelaez2010contour test dataset, (a) achieves 0.834 ODS and 0.848 OIS with step 5, and 0.833 ODS and 0.846 OIS with step 50. (b) achieves 0.841 ODS and 0.856 OIS with step 5, and 0.832 ODS and 0.848 OIS with step 50. (c) achieves 0.870 ODS and 0.880 OIS with step 1, which has a large margin improvement with less inference time.
  • Figure 3: The overall framework of our proposed GED. Given an input image $\mathbf{x}$, its corresponding label sets $\mathbf{y}$, and text prompt $p$, we first obtain granularity $g$ for each label by normalization. Then we extract text features $\mathbf{f}_l$ by text encoder $\mathcal{E}_l$, latent image feature maps $\mathbf{z}_i$ and latent edge maps $\mathbf{z}_e$ by VAE encoder $\mathcal{E}$. We feed granularity $g$, latent image maps $\mathbf{z}_i$, corresponding time $t=1$ and text features $\mathbf{f}_l$ into the denosing U-Net $\mathcal{U}_\theta$ to obtain predicted latent edge maps $\hat{\mathbf{z}}_e$, which are decoded by $\mathcal{D}$ to the final edge prediction $\hat{\mathbf{y}}$. The granularity $g$ is encoded to $\mathbf{f}_g$ as the same dimension of the time embeddings $\mathbf{f}_t$ by two fully connected layers and then pixel-wise added to the time embeddings $\mathbf{f}_t$. We also add explicit regularizations to ensure relative ordinal granularity relationships.
  • Figure 4: Qualitative comparisons on challenging samples in the BSDS500 test set. Note that MuGE and our proposed GED produce diverse results with edge granularity of 0, 0.5, and 1, respectively.
  • Figure 5: Qualitative comparisons on challenging samples in the NYUD and BIPED test set.
  • ...and 6 more figures