Table of Contents
Fetching ...

SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection

Rohit Venkata Sai Dulam, Chandra Kambhamettu

TL;DR

SODAWideNet++ addresses the limited transferability of ImageNet-pretrained backbones for Salient Object Detection by introducing an end-to-end pre-trained, encoder-decoder network that fuses convolutional inductive biases with self-attention. The core innovation is Attention Guided Long Range Feature Extraction (AGLRFE), which combines large-dilation convolutions with self-attention to produce input-dependent long-range features, complemented by Attention-enhanced Local Processing Module (ALPM). Pre-training on a modified COCO semantic segmentation dataset with binarized saliency labels enables end-to-end optimization, followed by fine-tuning on standard SOD benchmarks; background supervision further improves accuracy. The approach achieves competitive results on five datasets with substantially fewer trainable parameters (~33–35%), demonstrating the viability of end-to-end SOD pre-training and the benefit of integrating attention into convolutional pipelines for dense prediction tasks.

Abstract

Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at https://github.com/VimsLab/SODAWideNetPlusPlus.

SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection

TL;DR

SODAWideNet++ addresses the limited transferability of ImageNet-pretrained backbones for Salient Object Detection by introducing an end-to-end pre-trained, encoder-decoder network that fuses convolutional inductive biases with self-attention. The core innovation is Attention Guided Long Range Feature Extraction (AGLRFE), which combines large-dilation convolutions with self-attention to produce input-dependent long-range features, complemented by Attention-enhanced Local Processing Module (ALPM). Pre-training on a modified COCO semantic segmentation dataset with binarized saliency labels enables end-to-end optimization, followed by fine-tuning on standard SOD benchmarks; background supervision further improves accuracy. The approach achieves competitive results on five datasets with substantially fewer trainable parameters (~33–35%), demonstrating the viability of end-to-end SOD pre-training and the benefit of integrating attention into convolutional pipelines for dense prediction tasks.

Abstract

Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at https://github.com/VimsLab/SODAWideNetPlusPlus.
Paper Structure (20 sections, 12 equations, 4 figures, 8 tables)

This paper contains 20 sections, 12 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The proposed architecture SODAWideNet++ contains two branches, one to extract global features using AGLRFE and the other to extract local features using ALPM. These global and local features pass through CFM, producing the output of an encoding layer. The decoding layers also consist of two parallel paths, MRFFAM, to decode features through multiple receptive fields and an Identity operation.
  • Figure 2: Attention-guided Long-Range Feature Extraction module (AGLRFE) consists of two branches: a collection of dilated convolutions with different dilation rates to extract long-range convolution features and a self-attention block. We reduce the spatial resolution of the input before the Self-Attention block using Average Pooling and refine them using a series of convolution operations. Attention features are then upsampled to the exact resolution as the convolution features from the dilated convolutions. Then, using a series of convolution layers, we bring its channel size to one and pass it through a Sigmoid layer. These features refine our long-range convolution features, thus inducing input-reliance.
  • Figure 3: In the above figure, we visually illustrate the pixels that receive a higher weight for loss computation as shown in equation \ref{['fgbgloss']}. The third image illustrates in white the pixels that receive a higher weight for the background loss, and the fourth image illustrates the pixels receiving a higher weight for the foreground loss. The last two images depict the important pixels (in blue) superimposed (SI) on the input image.
  • Figure 4: In the above figure, we visually compare our results against other models.