Table of Contents
Fetching ...

Boosting Edge Detection with Pixel-wise Feature Selection: The Extractor-Selector Paradigm

Hao Shu

TL;DR

Edge detection models often fuse multi-scale features uniformly, which fails to distinguish edge from texture regions. The authors introduce the Extractor-Selector (E-S) paradigm, deploying a pixel-wise selector in tandem with a feature extractor to enable adaptive fusion, with an enhanced EES variant that leverages richer, less-degraded intermediate features. Across BRIND, BIPED2, UDED, BSDS500, and NYUD2, E-S and especially EES yield substantial gains in ODS, OIS, and AP without post-processing, validating the approach's effectiveness and robustness. The framework preserves compatibility with existing ED architectures and shows potential for broader applications such as contour detection and segmentation, offering a practical path to more precise and perceptually satisfying edge predictions.

Abstract

Deep learning has significantly advanced image edge detection (ED), primarily through improved feature extraction. However, most existing ED models apply uniform feature fusion across all pixels, ignoring critical differences between regions such as edges and textures. To address this limitation, we propose the Extractor-Selector (E-S) paradigm, a novel framework that introduces pixel-wise feature selection for more adaptive and precise fusion. Unlike conventional image-level fusion that applies the same convolutional kernel to all pixels, our approach dynamically selects relevant features at each pixel, enabling more refined edge predictions. The E-S framework can be seamlessly integrated with existing ED models without architectural changes, delivering substantial performance gains. It can also be combined with enhanced feature extractors for further accuracy improvements. Extensive experiments across multiple benchmarks confirm that our method consistently outperforms baseline ED models. For instance, on the BIPED2 dataset, the proposed framework can achieve over 7$\%$ improvements in ODS and OIS, and 22$\%$ improvements in AP, demonstrating its effectiveness and superiority.

Boosting Edge Detection with Pixel-wise Feature Selection: The Extractor-Selector Paradigm

TL;DR

Edge detection models often fuse multi-scale features uniformly, which fails to distinguish edge from texture regions. The authors introduce the Extractor-Selector (E-S) paradigm, deploying a pixel-wise selector in tandem with a feature extractor to enable adaptive fusion, with an enhanced EES variant that leverages richer, less-degraded intermediate features. Across BRIND, BIPED2, UDED, BSDS500, and NYUD2, E-S and especially EES yield substantial gains in ODS, OIS, and AP without post-processing, validating the approach's effectiveness and robustness. The framework preserves compatibility with existing ED architectures and shows potential for broader applications such as contour detection and segmentation, offering a practical path to more precise and perceptually satisfying edge predictions.

Abstract

Deep learning has significantly advanced image edge detection (ED), primarily through improved feature extraction. However, most existing ED models apply uniform feature fusion across all pixels, ignoring critical differences between regions such as edges and textures. To address this limitation, we propose the Extractor-Selector (E-S) paradigm, a novel framework that introduces pixel-wise feature selection for more adaptive and precise fusion. Unlike conventional image-level fusion that applies the same convolutional kernel to all pixels, our approach dynamically selects relevant features at each pixel, enabling more refined edge predictions. The E-S framework can be seamlessly integrated with existing ED models without architectural changes, delivering substantial performance gains. It can also be combined with enhanced feature extractors for further accuracy improvements. Extensive experiments across multiple benchmarks confirm that our method consistently outperforms baseline ED models. For instance, on the BIPED2 dataset, the proposed framework can achieve over 7 improvements in ODS and OIS, and 22 improvements in AP, demonstrating its effectiveness and superiority.
Paper Structure (32 sections, 5 equations, 6 figures, 16 tables)

This paper contains 32 sections, 5 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: The framework of the E-S paradigm. It consists of two components: a feature extractor which produces features and coarse edge maps, and a feature selector which selects the features for refinements. The combination of the two components ultimately outputs the final edge predictions.
  • Figure 2: An overview of the feature selector architecture: It is structured like a U-Net. The input images pass through a feature extraction block, followed by four $\frac{1}{2}$ down-sampling blocks to capture multi-scale features. At the $\frac{1}{8}$ and $\frac{1}{16}$ scales, cascaded transformer encoder blocks are applied. The features from the $\frac{1}{16}$ scale are progressively up-sampled, and each up-sample block incorporates a residual connection with a learnable coefficient to balance the up-sampled features and the features from the residual connections. Finally, pixel-wise weights are produced by fusing the ordinary-scale features. Detailed block designs are presented in supplementary materials.
  • Figure 3: The standard extractor architecture: The backbone network extracts multi-scale features through progressive down-sampling. These features are then fused into one or two edge maps at each scale and then up-sampled to the original image resolution. The final edge prediction is obtained by fusing these up-sampled maps. However, because the features are compressed prior to up-sampling, the resulting high-resolution representations often suffer from substantial information loss, limiting the usage by the selector.
  • Figure 4: The modified feature extractor architecture. Unlike the standard extractor architecture, which compresses multi-scale features before up-sampling, the modified version preserves richer information by avoiding early fusion. It uses the same backbone to extract multi-scale features, which are first unified to a suitable number of channels and then up-sampled without compression. Two additional fixed feature maps (one filled with zeros and one with ones) are appended to assist the selector in identifying highly confident edge and non-edge regions. Although a coarse fusion is applied during pre-training of the extractor, it is these enhanced, unfused features that are passed to the selector for final prediction, providing more refined and less lossy representations compared to the standard approach.
  • Figure 5: Visual comparisons of the models: Columns 1 presents the ordinary image and its cropped regions marked by blue (the up-left words) and yellow boxes (the down-right tire), respectively. Column 2 displays the corresponding ground-truths. Columns 3, 5, and 7 display the predictions of the baseline models, while columns 4, 6, and 8 exhibit the predictions of the corresponding EES frameworks. Obviously, the EES frameworks obtain better perceptual results.
  • ...and 1 more figures