Table of Contents
Fetching ...

Efficient Fourier Filtering Network with Contrastive Learning for AAV-based Unaligned Bimodal Salient Object Detection

Pengfei Lyu, Pak-Hei Yeung, Xiaosheng Yu, Xiufei Cheng, Chengdong Wu, Jagath C. Rajapakse

TL;DR

The paper tackles the challenge of real-time, accurate BSOD on autonomous aerial vehicles when RGB and thermal modalities are unaligned. It introduces AlignSal, a lightweight framework that combines an FFT-based encoder with two novel modules: SCAL for semantic-level cross-modal alignment via a contrastive loss, and SAF for pixel-level fusion through FFT-based global filtering. Empirical results across the AAV RGB-T 2400 dataset and other bimodal benchmarks show that AlignSal achieves real-time performance while surpassing many state-of-the-art models in accuracy, with substantial reductions in parameters and computational cost compared to the leading MROS system. The ablation studies confirm the critical roles of SCAL and SAF, and the generalization tests demonstrate robustness across weakly aligned, aligned, and remote sensing datasets, underscoring AlignSal’s practicality for real-world AAV deployment.

Abstract

Autonomous aerial vehicle (AAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing AAV-based BSOD models limits their applicability to real-world AAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the AAV RGB-T 2400 and seven bi-modal dense prediction datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to nineteen state-of-the-art models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on AAV-based unaligned data. The code is available at: https://github.com/JoshuaLPF/AlignSal.

Efficient Fourier Filtering Network with Contrastive Learning for AAV-based Unaligned Bimodal Salient Object Detection

TL;DR

The paper tackles the challenge of real-time, accurate BSOD on autonomous aerial vehicles when RGB and thermal modalities are unaligned. It introduces AlignSal, a lightweight framework that combines an FFT-based encoder with two novel modules: SCAL for semantic-level cross-modal alignment via a contrastive loss, and SAF for pixel-level fusion through FFT-based global filtering. Empirical results across the AAV RGB-T 2400 dataset and other bimodal benchmarks show that AlignSal achieves real-time performance while surpassing many state-of-the-art models in accuracy, with substantial reductions in parameters and computational cost compared to the leading MROS system. The ablation studies confirm the critical roles of SCAL and SAF, and the generalization tests demonstrate robustness across weakly aligned, aligned, and remote sensing datasets, underscoring AlignSal’s practicality for real-world AAV deployment.

Abstract

Autonomous aerial vehicle (AAV)-based bi-modal salient object detection (BSOD) aims to segment salient objects in a scene utilizing complementary cues in unaligned RGB and thermal image pairs. However, the high computational expense of existing AAV-based BSOD models limits their applicability to real-world AAV devices. To address this problem, we propose an efficient Fourier filter network with contrastive learning that achieves both real-time and accurate performance. Specifically, we first design a semantic contrastive alignment loss to align the two modalities at the semantic level, which facilitates mutual refinement in a parameter-free way. Second, inspired by the fast Fourier transform that obtains global relevance in linear complexity, we propose synchronized alignment fusion, which aligns and fuses bi-modal features in the channel and spatial dimensions by a hierarchical filtering mechanism. Our proposed model, AlignSal, reduces the number of parameters by 70.0%, decreases the floating point operations by 49.4%, and increases the inference speed by 152.5% compared to the cutting-edge BSOD model (i.e., MROS). Extensive experiments on the AAV RGB-T 2400 and seven bi-modal dense prediction datasets demonstrate that AlignSal achieves both real-time inference speed and better performance and generalizability compared to nineteen state-of-the-art models across most evaluation metrics. In addition, our ablation studies further verify AlignSal's potential in boosting the performance of existing aligned BSOD models on AAV-based unaligned data. The code is available at: https://github.com/JoshuaLPF/AlignSal.

Paper Structure

This paper contains 18 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Examples from AAV RGB-T 2400 dataset 10315195. RGB images typically have higher resolution and a wider field of view compared to thermal images.
  • Figure 2: The overall framework of the proposed AlignSal. The RGB and thermal images are first fed to a dual-stream encoder, which extracts initial bimodal features $\left\{ {f_i^r} \right\}_{i = 1}^{4}$ and $\left\{ {f_i^t} \right\}_{i = 1}^{4}$. During training, the semantic contrastive alignment loss (SCAL) facilitates the exchange of information between the RGB semantic feature $f_4^r$ and the thermal semantic feature $f_4^t$, essentially achieving bimodal alignment. The bimodal features are then aligned and fused at the pixel level by the synchronized alignment fusion (SAF). Finally, a simple decoder integrates the fused features $\left\{ {f_i^s} \right\}_{i = 1}^{4}$ from high to low levels, generating the decoder features $\left\{ {f_i^d} \right\}_{i = 1}^{4}$ and the final prediction $\mathcal{S}$.
  • Figure 3: Visual comparison of our AlignSal with state-of-the-art BSOD models in several challenges, including fast AAV movement ($1^{st}$, $2^{nd}$, and $6^{th}$ rows) and fast object movement ($8^{th}$ row) in object blurring scenes, low illumination ($1^{st}$ and $8^{th}$ rows), street light exposure ($3^{rd}$ and $9^{th}$ rows), and extreme low illumination ($4^{th}$ row) in illumination changing scenes, small objects ($2^{nd}$, $4^{th}$, $5^{th}$, and $8^{th}$ rows), out-of-view ($1^{st}$, $2^{nd}$, $6^{th}$, and $8^{th}$ rows), multiple objects($1^{st}$-$4^{th}$, $7^{th}$-$9^{th}$ rows), scale variation ($8^{th}$ row), and center bias ($4^{th}$ and $6^{th}$ rows) in object changing scenes, rain ($8^{th}$ row) and snow ($9^{th}$ row) in weather changing scenes. GT represents ground truth.
  • Figure 4: (a) Precision-recall (PR) and (b) F-measure-threshold (FT) curves of BSOD models on the AAV RGB-T 2400 dataset.
  • Figure 5: Visualization of the feature maps generated by models with and without SCAL and SAF. L1 to L4 represents the levels from low to high. (a) displays RGB and thermal images. (b) and (d) present the RGB and thermal feature maps, while (c) and (e) show the RGB and thermal feature maps generated by the model without SCAL. (f) shows the feature maps after SAF. (g) illustrates the fused feature maps from the model without SAF. (h) shows the feature maps after the decoder layers.
  • ...and 3 more figures