Table of Contents
Fetching ...

Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection

Pengfei Lyu, Xiaosheng Yu, Pak-Hei Yeung, Chengdong Wu, Jagath C. Rajapakse

TL;DR

RGB-T salient object detection faces challenges from variable lighting and high-resolution bimodal fusion, where Transformer-based approaches incur large memory costs. FreqSal introduces a purely FFT-based architecture that fuses RGB and thermal information in the frequency domain through MPA, clarifies edges with FEB, and decodes with high-frequency–oriented FRCAB, guided by CFL. The approach yields state-of-the-art results across ten RGB-T benchmarks and generalizes to RGB-D-T and RGB-D SOD while supporting high input resolutions up to $512^2$. This work demonstrates the strong potential of Fourier-domain learning for dense prediction tasks and provides a scalable alternative to transformer-heavy methods.

Abstract

The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models' predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.

Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection

TL;DR

RGB-T salient object detection faces challenges from variable lighting and high-resolution bimodal fusion, where Transformer-based approaches incur large memory costs. FreqSal introduces a purely FFT-based architecture that fuses RGB and thermal information in the frequency domain through MPA, clarifies edges with FEB, and decodes with high-frequency–oriented FRCAB, guided by CFL. The approach yields state-of-the-art results across ten RGB-T benchmarks and generalizes to RGB-D-T and RGB-D SOD while supporting high input resolutions up to . This work demonstrates the strong potential of Fourier-domain learning for dense prediction tasks and provides a scalable alternative to transformer-heavy methods.

Abstract

The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models' predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.

Paper Structure

This paper contains 24 sections, 42 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Frequency spectrum visualization: (a) RGB and (b) thermal images, with (c) and (d) showing their corresponding frequency spectrograms.
  • Figure 2: The framework of our proposed FreqSal. RGB and thermal images are first extracted initial features $\left\{ {{r_i}} \right\}_{i = 1}^4$ and $\left\{ {{t_i}} \right\}_{i = 1}^4$ by a dual-stream encoder. Next, the bimodal features enter the Modal-coordinated Perception Attention (MPA) to bridge the complementary information and fuse. During decoding, the fused features $\left\{ {{f_i}} \right\}_{i = 1}^4$ are integrated in a low- to high-resolution manner by Fourier Residual Channel Attention Block (FRCAB) to obtain decoder features $\left\{ {{d_i}} \right\}_{i = 1}^4$. Meanwhile, the shallow features $\left\{ {{t_i, f_i}} \right\}_{i = 1}^2$ are fed into the Frequency-decomposed Edge-aware Block (FEB) to obtain elaborate edge features $\left\{ {{e_i}} \right\}_{i = 1}^3$ for guiding the decoding process in generating the accurate saliency map $\mathcal{S}$. In the training stage, the Co-focus Frequency Loss (CFL) ${{\cal L}_{CFL}}$ favoring the synthesis of difficult frequencies and spatial-domain losses are employed together to supervise the generation of the high-quality edge map $\mathcal{E}$.
  • Figure 3: Architectures of the Modal-coordinated Perception Attention (MPA) and its core component Modal-coordinated Perception Filter (MPF). The RGB feature $r_i$ and thermal feature $t_i$ serve as inputs to the MPA, producing the output $f_i$, while $\widetilde{f_i}$ denotes the output of the MPF.
  • Figure 4: The visualization for the amplitude, phase, high-frequency, and low-frequency components of RGB and thermal images.
  • Figure 5: Architecture of the Edge Frequency Extraction Block (EFEB). It enhances the low-level thermal feature $t_i$ and fusion feature $f_i$ through the Phase Enhancement Process (PEP) and an adaptive high-pass filter, thereby obtaining clear edge features $e_i$.
  • ...and 11 more figures