Table of Contents
Fetching ...

Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network

Kunpeng Wang, Keke Chen, Chenglong Li, Zhengzheng Tu, Bin Luo

TL;DR

This work tackles alignment-free RGB-T salient object detection by introducing UVT20K, the largest unaligned RGB-T dataset with 20,000 image pairs, rich annotations, and 407 scenes across 1256 object categories. It also presents Progressive Correlation Network (PCNet), which explicitly aligns common regions via Semantics-guided Homography Estimation (SHE) and progressively models inter- and intra-modal correlations with an Inter- and Intra-Modal Correlation (IIMC) module. Across unaligned and aligned benchmarks, PCNet achieves state-of-the-art results, demonstrating robustness to misalignment and effective multimodal fusion, with the UVT20K dataset enabling scalable, real-world research in alignment-free RGB-T SOD. The dataset and code are publicly available to advance practical multimodal saliency detection and robust fusion under misalignment conditions.

Abstract

Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on unaligned and aligned datasets demonstrate the effectiveness of our method.Code and dataset are available at https://github.com/Angknpng/PCNet.

Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network

TL;DR

This work tackles alignment-free RGB-T salient object detection by introducing UVT20K, the largest unaligned RGB-T dataset with 20,000 image pairs, rich annotations, and 407 scenes across 1256 object categories. It also presents Progressive Correlation Network (PCNet), which explicitly aligns common regions via Semantics-guided Homography Estimation (SHE) and progressively models inter- and intra-modal correlations with an Inter- and Intra-Modal Correlation (IIMC) module. Across unaligned and aligned benchmarks, PCNet achieves state-of-the-art results, demonstrating robustness to misalignment and effective multimodal fusion, with the UVT20K dataset enabling scalable, real-world research in alignment-free RGB-T SOD. The dataset and code are publicly available to advance practical multimodal saliency detection and robust fusion under misalignment conditions.

Abstract

Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on unaligned and aligned datasets demonstrate the effectiveness of our method.Code and dataset are available at https://github.com/Angknpng/PCNet.

Paper Structure

This paper contains 17 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison on scale (i.e., circular area), scenes (i.e., vertical axis), and object categories (i.e., horizontal axis) of the proposed UVT20K dataset with existing representative RGB-T and RGB-D SOD datasets, including UVT2000 wang2024alignment, VT5000 tu2020rgbt, VT1000 tu2019rgb, VT821 wang2018rgb, ReDWeb-S liu2021learning, SIP fan2020rethinking, and DUT-RGBD piao2019depth.
  • Figure 2: Main statistics and characteristics of our UVT20K dataset.
  • Figure 3: The overall architecture of our proposed Progressive Correlation Network (PCNet). The framework mainly comprises a Semantics-guided Homography Estimation (SHE) module and an Inter- and Intra-Modal Correlation (IIMC) module. SHE is fine-tuned by the S-Adapter to explicitly align the corresponding regions in visible-thermal image pairs. IIMC first models inter-modal correlations for the aligned regions, and then expand the correlations to the whole RGB modality.
  • Figure 4: The details of the proposed S-Adapter.
  • Figure 5: Examples of before and after warping.