Table of Contents
Fetching ...

VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu, Ziyang Luo, Ni Zhang, Junwei Han

TL;DR

This work advances saliency detection by extending a pure-transformer framework (VST) to VST++, addressing efficiency and cross-modality generalization across RGB, RGB-D, and RGB-T data. It introduces Select-Integrate Attention to reduce quadratic self-attention costs, a depth position encoding to leverage RGB-D depth cues, and a token-supervised prediction loss to directly guide task tokens. A token-based multi-task decoder with reverse T2T upsampling enables high-resolution dense saliency maps without heavy CNN-style upsampling. Empirical results show state-of-the-art or competitive performance with notable computational savings, highlighting the practicality of Transformer-based SOD with cross-modal fusion. The approach demonstrates strong generalization across backbones and modalities, underscoring the potential of pure-transformer architectures for dense, multi-view saliency tasks.

Abstract

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

VST++: Efficient and Stronger Visual Saliency Transformer

TL;DR

This work advances saliency detection by extending a pure-transformer framework (VST) to VST++, addressing efficiency and cross-modality generalization across RGB, RGB-D, and RGB-T data. It introduces Select-Integrate Attention to reduce quadratic self-attention costs, a depth position encoding to leverage RGB-D depth cues, and a token-supervised prediction loss to directly guide task tokens. A token-based multi-task decoder with reverse T2T upsampling enables high-resolution dense saliency maps without heavy CNN-style upsampling. Empirical results show state-of-the-art or competitive performance with notable computational savings, highlighting the practicality of Transformer-based SOD with cross-modal fusion. The approach demonstrates strong generalization across backbones and modalities, underscoring the potential of pure-transformer architectures for dense, multi-view saliency tasks.

Abstract

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.
Paper Structure (30 sections, 13 equations, 7 figures, 6 tables)

This paper contains 30 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overall architecture of our proposed VST++ model for both RGB and RGB-D SOD. The encoder generates multi-level tokens from the input image patch sequence. Next, the convertor projects the patch tokens to the decoder space and performs cross-modal information fusion for RGB-D SOD. Finally, a multi-task transformer decoder simultaneously performs saliency detection and boundary segmentation via the proposed task-related tokens and the patch-task-attention mechanism. An RT2T transformation is designed to progressively upsample the patch tokens. In 1/8 and 1/4 decoder levels, we design Select-Integrate Attention (SIA) to select fine-grained foreground segments and aggregate coarse-grained background information for low-cost attention computation. We maintain the self-attention mechanism for the 1/16 decoder level, given the absence of a mask from the previous stage. Depth Position Embedding (DPE) or sinusoidal position encoding (PE) is added to the query and key of the SIA for RGB-D SOD and RGB SOD, respectively. Additionally, we calculate the token-supervised prediction losses at every decoder level. The dotted line represents components exclusively designed for RGB-D SOD.
  • Figure 2: (a) T2T module merges adjacent tokens into a new token, effectively reducing token length. (b) Our reverse T2T module enlarges each token into multiple sub-tokens, achieving token upsampling.
  • Figure 3: Architecture of the SIA module. We utilize the upsampled and binarized mask $\bm{M}_i$ from the previous stage to select the foreground patch tokens as $\bm{T}^{\mathcal{D}}_{i\_f}$. The background regions are then integrated into a background token $t_i^g$. We use them to replace the original patch tokens $\bm{T}^{\mathcal{D}}_{i}$ to generate key and value, while $\bm{T}^{\mathcal{D}}_{i}$ is used to obtain query for performing cross attention. The two task tokens are also used in the query, key, and value. Depth Position Embedding (DPE) or sinusoidal position encoding (PE) is added to the query and key of the SIA for RGB-D SOD and RGB SOD, respectively.
  • Figure 4: Qualitative comparison of our model against state-of-the-art RGB SOD methods. (GT: ground truth.)
  • Figure 5: Qualitative comparison of our model against state-of-the-art RGB-D SOD methods. (GT: ground truth.)
  • ...and 2 more figures