Table of Contents
Fetching ...

CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection

Jin Lian, Zhongyu Wan, Ming Gao, JunFeng Chen

TL;DR

CFMD addresses two key issues in salient object detection: boundary degradation from upsampling and inefficiency in multi-scale feature fusion. It introduces CFLMA, a Mamba-based context-aware aggregation for dynamic cross-layer weighting, and CLFDD, a dynamic upsampling distribution that uses content-informed offsets to preserve spatial details during resolution recovery. Together, these modules form a two-stage, architecture-agnostic framework that improves pixel-level accuracy and boundary segmentation, with ablations showing strong gains on challenging datasets and across backbones. The results suggest substantial practical benefits for real-time and robust saliency detection in complex scenes, while future work points to RGB-D extension, mobile adaptation, and 3D extensions of the long-range dependency modeling.

Abstract

Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection. However, traditional CFPNs still suffer from two core limitations: (1) a computational bottleneck caused by complex feature weighting operations, and (2) degraded boundary accuracy due to feature blurring in the upsampling process. To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations. First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism. This module adaptively adjusts feature importance based on image context, significantly improving both representation efficiency and generalization. Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery. By adjusting the upsampling range dynamically and initializing with a bilinear strategy, the module effectively reduces feature overlap and maintains fine-grained boundary structures. Extensive experiments on three standard benchmarks using three mainstream backbone networks demonstrate that CFMD achieves substantial improvements in pixel-level accuracy and boundary segmentation quality, especially in complex scenes. The results validate the effectiveness of CFMD in jointly enhancing computational efficiency and segmentation performance, highlighting its strong potential in salient object detection tasks.

CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection

TL;DR

CFMD addresses two key issues in salient object detection: boundary degradation from upsampling and inefficiency in multi-scale feature fusion. It introduces CFLMA, a Mamba-based context-aware aggregation for dynamic cross-layer weighting, and CLFDD, a dynamic upsampling distribution that uses content-informed offsets to preserve spatial details during resolution recovery. Together, these modules form a two-stage, architecture-agnostic framework that improves pixel-level accuracy and boundary segmentation, with ablations showing strong gains on challenging datasets and across backbones. The results suggest substantial practical benefits for real-time and robust saliency detection in complex scenes, while future work points to RGB-D extension, mobile adaptation, and 3D extensions of the long-range dependency modeling.

Abstract

Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection. However, traditional CFPNs still suffer from two core limitations: (1) a computational bottleneck caused by complex feature weighting operations, and (2) degraded boundary accuracy due to feature blurring in the upsampling process. To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations. First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism. This module adaptively adjusts feature importance based on image context, significantly improving both representation efficiency and generalization. Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery. By adjusting the upsampling range dynamically and initializing with a bilinear strategy, the module effectively reduces feature overlap and maintains fine-grained boundary structures. Extensive experiments on three standard benchmarks using three mainstream backbone networks demonstrate that CFMD achieves substantial improvements in pixel-level accuracy and boundary segmentation quality, especially in complex scenes. The results validate the effectiveness of CFMD in jointly enhancing computational efficiency and segmentation performance, highlighting its strong potential in salient object detection tasks.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The Cross-layer Feature mamba Aggregation module proposed in this paper introduces a dual dynamic mechanism based on CFPN: the Mamba module replaces the traditional global pooling layer and utilizes the state-space model (SSM) to capture the feature long-range dependency and generates the channel attention mask to realize the dynamic calibration; at the same time, it adopts the nonlinear convolution and the dynamic offset generation technique to construct the dynamic upsampling module tang2025vpnextrethinkingdenserando2025serpentselectiveresamplingexpressiveqiu2025sparsemambapclscribblesupervisedmedicalimage, which performs fine scanning (step size 1×1) in the high-entropy region (target boundary) and coarse-grained sampling (step size 2×2) in the low-entropy background region. This design significantly improves the adaptability and semantic alignment accuracy of feature fusion, all while maintaining a reassuringly lightweight architecture.
  • Figure 2: The feature maps are processed through the global pooling layer and scaled to different sizes to produce images of sizes d0, d1, d2, and d4, respectively. Next, these images are feature extracted using a key component in our process-the Mamba module. This module is responsible for extracting the features from the images. The extracted features are then fused with the corresponding weights ($\Psi_0$, $\Psi_1$, $\Psi_2$, $\Psi_3$). The fused feature information is fed into the Output global pooling layer for subsequent processing and output of the final results.
  • Figure 3: The structure employs the selective state scanning mechanism of the Mamba module, where the multilevel feature map is first downscaled and converted into a serialized form for input to the module. Subsequently, the core state transfer equation $x_k = \bar{A}x_{k-1} + \bar{B}u_k$ is used to perform the temporal feature computation, where the dynamically generated parameter matrices $\bar{A}$ and $\bar{B}$ are responsible for capturing the long-range spatial dependencies. The computation process sequentially realizes the content-aware feature enhancement through the gating mechanism and utilizes the transmitter module $y_k = \bar{C}x_k$ to complete the transition from hidden state to observable features. The final processed sequence data is restored to the original spatial dimension $\mathbb{R}^{B \times 256 \times H \times W}$ by the dimension restoration operation, which is used as an input to the state model for the next moment.
  • Figure 4: To address the up-sampling problem, we have converted it to point sampling and implemented it with PyTorch built-in functions. Our method involves initially generating offsets by linear projection combined with bilinear interpolation. In our improvement, we have introduced the concept of bilinear initialization to enhance accuracy. We have also introduced the static dynamic range factor, where $\sigma$ is set to be a constrained offset in the interval of 0.25. The Sampling set is responsible for converting the information into point sampling units, and the GridSampling is used to aggregate the sampling units to reduce the overlap of the sampling points. We have used feature grouping to generate the sample points independently, thereby enhancing the flexibility of our method. Lastly, we have compared “Linear + Pixel Blending” and “Pixel Blending + Linear” to find the optimal solution and improve the model performance.
  • Figure 5: This figure shows the deep learning feature processing flow: the CFMA module obtains the data to obtain the global feature $F_{agg}$, and then by the CFMD multi-scale pooling to obtain multiple feature maps. Each feature map uses DySample to generate offsets and dynamically upsampling, and finally fused into a multilevel feature pyramid.