Table of Contents
Fetching ...

Uncertainty Guided Refinement for Fine-Grained Salient Object Detection

Yao Yuan, Pan Gao, Qun Dai, Jie Qin, Wei Xiang

TL;DR

This work tackles the persistent issue of fine-grained saliency predictions being undermined by shadows and undersaturation near object boundaries. It introduces UGRAN, an uncertainty-guided refinement framework comprising three modules—Multilevel Interaction Attention (MIA), Scale Spatial-Consistent Attention (SSCA), and Uncertainty Refinement Attention (URA)—plus an Adaptive Dynamic Partition (ADP) mechanism to balance performance and computation. The approach yields state-of-the-art results on seven benchmark datasets and runs in real-time, with an emphasis on explicit uncertainty-guided refinement rather than traditional boundary priors. The methodology and findings offer a path to more reliable binary segmentation in challenging visual scenes and can be extended to related tasks beyond SOD.

Abstract

Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.

Uncertainty Guided Refinement for Fine-Grained Salient Object Detection

TL;DR

This work tackles the persistent issue of fine-grained saliency predictions being undermined by shadows and undersaturation near object boundaries. It introduces UGRAN, an uncertainty-guided refinement framework comprising three modules—Multilevel Interaction Attention (MIA), Scale Spatial-Consistent Attention (SSCA), and Uncertainty Refinement Attention (URA)—plus an Adaptive Dynamic Partition (ADP) mechanism to balance performance and computation. The approach yields state-of-the-art results on seven benchmark datasets and runs in real-time, with an emphasis on explicit uncertainty-guided refinement rather than traditional boundary priors. The methodology and findings offer a path to more reliable binary segmentation in challenging visual scenes and can be extended to related tasks beyond SOD.

Abstract

Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.

Paper Structure

This paper contains 27 sections, 10 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: The uncertainty map, compared to the boundary map, more accurately reflects the areas with artifacts and low saturation in the current saliency prediction, thus providing more targeted guidance for the model.
  • Figure 2: The illustration of proposed Adaptive Dynamic Partition (ADP) mechanism. (a) In the blur region, we discontinue further partition to ensure the efficacy of uncertainty guidance; (b) For the clear region, we perform further partition to minimize computational expenses.
  • Figure 3: Architecture of the proposed model. The backbone is defined as a hierarchical network structure (e.g. ResNetResNet, SwinTransformerSwin). The multilevel features extracted by the backbone are denoted as F$_0$ - F$_4$ and the spatial dimensions decrease sequentially.
  • Figure 4: Details of proposed modules. All convolutions, excluding the labeled ones, utilize a 1$\times$1 kernel size, a stride of 1, and preserve the channel dimension.
  • Figure 5: The feature maps are progressively upsampled during the inference process until they align with the original image size. The upsample ratio is variable and can be adjusted based on specific requirements.
  • ...and 3 more figures