Table of Contents
Fetching ...

Unifying Global-Local Representations in Salient Object Detection with Transformer

Sucheng Ren, Qiang Wen, Nanxuan Zhao, Guoqiang Han, Shengfeng He

TL;DR

A new attention-based encoder, vision transformer, is introduced into salient object detection to ensure the globalization of the representations from shallow to deep layers to recover the spatial details in final saliency maps.

Abstract

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE). Code will be available at https://github.com/OliverRensu/GLSTR.

Unifying Global-Local Representations in Salient Object Detection with Transformer

TL;DR

A new attention-based encoder, vision transformer, is introduced into salient object detection to ensure the globalization of the representations from shallow to deep layers to recover the spatial details in final saliency maps.

Abstract

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making that every output feature of transformer layers contributes uniquely for final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE). Code will be available at https://github.com/OliverRensu/GLSTR.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples of our method. We propose the Global-Local Saliency TRansformer (GLSTR) to unify global and local features in each layer. We compare our method with a SOTA method, GateNet zhao2020suppress, based on FCN architecture. Our method can localize salient region precisely with accurate boundary.
  • Figure 2: The visual attention map of the red block in the input image on the first and twelfth layers of the transformer. The attention maps show that features from shallow layer can also have global information and features from deep layer can also have local information.
  • Figure 3: The pipeline of our proposed method. We first divide the image into non-overlap patches and map each patch into the token before feeding to transformer layers. After encoding features through 12 transformer layers, we decode each output feature in three successive stages with 8$\times$, 4$\times$, and 2$\times$ upsampling respectively. Each decoding stage contains four layers and the input of each layer comes from the features of its previous layer together with the corresponding transformer layer.
  • Figure 4: Different types of decoders. (a) Naive Decoder directly upsamples the output 16$\times$. (b) Stage-by-Stage Decoder upsamples the resolution 2$\times$ in each stage. (c) Multi-level feature Aggregation Decoder sparsely fuses multi-level features. Our decoder (d) densely decodes all transformer features and gradually upsamples to the resolution of inputs.
  • Figure 5: Qualitative comparisons with state-of-the-art methods. Our method provides more visually reasonable saliency maps by accurately locating salient objects and generating sharp boundaries than other transformer-based (denoted as T) and FCN-based methods (denoted as C).
  • ...and 1 more figures