DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection
Yu Zhang, Haoan Ping, Yuchen Li, Zhenshan Bing, Fuchun Sun, Alois Knoll
TL;DR
This work tackles salient object detection by reconciling accuracy and efficiency through a biologically inspired, pure-Transformer design. DualGazeNet integrates two principles from human vision—robust representation learning and magnocellular–parvocellular dual-pathway processing—via four modules: HFEM for global feature extraction, MGQM for cortical-attention-guided localization, MGFRM for query-guided feature reconstruction, and SMGM for pixel-level saliency. It trains with a simple Dice plus BCE loss and demonstrates state-of-the-art performance on five RGB-SOD benchmarks, with about 60% faster inference and 53.4% fewer FLOPs than comparable Transformer baselines, plus strong generalization to camouflaged and underwater SOD without additional modalities. The results suggest that biologically grounded simplicity can outperform more complex architectures while improving interpretability, motivating broader exploration of dual-gaze mechanisms in vision tasks and multi-modal extensions.
Abstract
Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60\% higher inference speed and 53.4\% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.
