Table of Contents
Fetching ...

DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection

Yu Zhang, Haoan Ping, Yuchen Li, Zhenshan Bing, Fuchun Sun, Alois Knoll

TL;DR

This work tackles salient object detection by reconciling accuracy and efficiency through a biologically inspired, pure-Transformer design. DualGazeNet integrates two principles from human vision—robust representation learning and magnocellular–parvocellular dual-pathway processing—via four modules: HFEM for global feature extraction, MGQM for cortical-attention-guided localization, MGFRM for query-guided feature reconstruction, and SMGM for pixel-level saliency. It trains with a simple Dice plus BCE loss and demonstrates state-of-the-art performance on five RGB-SOD benchmarks, with about 60% faster inference and 53.4% fewer FLOPs than comparable Transformer baselines, plus strong generalization to camouflaged and underwater SOD without additional modalities. The results suggest that biologically grounded simplicity can outperform more complex architectures while improving interpretability, motivating broader exploration of dual-gaze mechanisms in vision tasks and multi-modal extensions.

Abstract

Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.

DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection

TL;DR

This work tackles salient object detection by reconciling accuracy and efficiency through a biologically inspired, pure-Transformer design. DualGazeNet integrates two principles from human vision—robust representation learning and magnocellular–parvocellular dual-pathway processing—via four modules: HFEM for global feature extraction, MGQM for cortical-attention-guided localization, MGFRM for query-guided feature reconstruction, and SMGM for pixel-level saliency. It trains with a simple Dice plus BCE loss and demonstrates state-of-the-art performance on five RGB-SOD benchmarks, with about 60% faster inference and 53.4% fewer FLOPs than comparable Transformer baselines, plus strong generalization to camouflaged and underwater SOD without additional modalities. The results suggest that biologically grounded simplicity can outperform more complex architectures while improving interpretability, motivating broader exploration of dual-gaze mechanisms in vision tasks and multi-modal extensions.

Abstract

Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.

Paper Structure

This paper contains 20 sections, 10 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Accuracy-efficiency trade-offs of SOTA RGB-SOD methods on DUTS-TE. Circle size indicates inference latency (ms). Results are illustrated separately for (a) lightweight and (b) heavy models to avoid visual clutter and enhance readability.
  • Figure 2: Biological inspiration from human visual perception. Human vision operates through two complementary pathways: the magnocellular pathway rapidly processes coarse spatial information and global scene structure, and the parvocellular pathway refines fine spatial details and precise object boundaries, with cortical attention modulation in the brain dynamically coordinating information flow between these streams.
  • Figure 3: Attention map visualization of the DualGazeNet. (a) First gaze generates dispersed attention patterns to capture global scene context. (b) Second gaze, guided by query tokens, refines attention to focus on salient object regions. This visualization demonstrates how our method progressively refines attention from global scene understanding to salient object localization.
  • Figure 4: Architecture of DualGazeNet. Inspired by the dual pathways in human vision illustrated in Fig. \ref{['fig1']}, DualGazeNet implements a two-stage processing mechanism: the first "gaze" extracts coarse global structure (magnocellular-like pathway), and the second "gaze" performs fine spatial refinement (parvocellular-like pathway). Query learning mimics cortical attention modulation by selectively enhancing task-relevant features and suppressing background interference. The architecture comprises four key components: (1) HFEM generates multi-level features $F_1$--$F_4$; (2) MGQM forms pseudo-queries $Q_1^{'}$--$Q_4^{'}$ that interact with multi-level features $F_1$--$F_4$ via MHCA to yield cortical queries $Q_1$--$Q_4$; (3) MGRFM refines features $F_1$--$F_4$ with cortical queries $Q_1$--$Q_4$ and obtain enhanced hierarchical features $F^{'}_1$--$F^{'}_4$ through MRCA; (4) SMGM outputs the final prediction.
  • Figure 5: Illustration of the proposed adapter-enhanced Transformer block within the $i$-th stage.
  • ...and 11 more figures