Table of Contents
Fetching ...

SUM: Saliency Unification through Mamba for Visual Attention Modeling

Alireza Hosseini, Amirhossein Kazerouni, Saeed Akhavan, Michael Brudno, Babak Taati

TL;DR

This work tackles the challenge of universal visual saliency modeling across diverse image types while addressing the computational burden of Transformer-based approaches. It introduces SUM, a unified Mamba-U-Net-based predictor augmented with a Conditional Visual State Space (C-VSS) that uses data-type tokens to adapt to natural scenes, web pages, and commercial imagery. The model leverages the linear-complexity Mamba framework, a 2D-aware SS2D processing scheme, and a four-token conditioning mechanism to modulate features, achieving state-of-the-art or competitive results on six benchmarks and demonstrating robust cross-domain performance. Ablation analyses validate the importance of the loss composition, C-VSS placement, and prompt-based conditioning, underscoring SUM’s practical impact for efficient, universal visual attention modeling across diverse content domains.

Abstract

Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.

SUM: Saliency Unification through Mamba for Visual Attention Modeling

TL;DR

This work tackles the challenge of universal visual saliency modeling across diverse image types while addressing the computational burden of Transformer-based approaches. It introduces SUM, a unified Mamba-U-Net-based predictor augmented with a Conditional Visual State Space (C-VSS) that uses data-type tokens to adapt to natural scenes, web pages, and commercial imagery. The model leverages the linear-complexity Mamba framework, a 2D-aware SS2D processing scheme, and a four-token conditioning mechanism to modulate features, achieving state-of-the-art or competitive results on six benchmarks and demonstrating robust cross-domain performance. Ablation analyses validate the importance of the loss composition, C-VSS placement, and prompt-based conditioning, underscoring SUM’s practical impact for efficient, universal visual attention modeling across diverse content domains.

Abstract

Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.
Paper Structure (14 sections, 8 equations, 4 figures, 9 tables)

This paper contains 14 sections, 8 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: (a) Overview of our SUM model, (b) conditional-U-Net-based model for saliency prediction, and (c) C-VSS module.
  • Figure 2: Comparative visualizations of saliency predictions across different data types. The first row depicts Natural Scene-Mouse data, the second row showcases Natural Scene-Eye data, the third row features E-commerce, and the fourth row displays UI. Each row highlights the model's performance in identifying salient features within these distinct categories.
  • Figure 3: Visualizations of SUM’s predictions across different datasets. The first and second rows depict Natural Scene-Mouse data, while the third and fourth rows showcase Natural Scene-Eye data. The fifth and sixth rows present E-commerce data, and the seventh and eighth rows display UI data.
  • Figure 4: Visualizations of SUM’s predictions across different datasets. The first and second rows showcase the Toronto dataset bruce2007attention, while the third and fourth rows present the FIWI dataset shen2014webpage. The fifth and sixth rows display data from the TUD Image Quality Database 1 liu2009studying, and the seventh and eighth rows exhibit data from the TUD Image Quality Database 2 alers2010studying.