Table of Contents
Fetching ...

HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion

Jiahang Li, Peng Yun, Yang Xu, Ye Zhang, Mingjian Sun, Qijun Chen, Ilin Alexander, Rui Fan

TL;DR

This work tackles RGB-T scene parsing by addressing the limitations of symmetric encoders that fail to exploit differences between RGB and thermal modalities. It introduces HAPNet, a hybrid asymmetric architecture that fuses Vision Foundation Model features from RGB with CNN-derived cross-modal spatial priors through four Progressive Heterogeneous Feature Integrators, followed by a Mask Classification Decoder and an auxiliary local-semantics task. Key contributions include the Cross-modal Spatial Prior Descriptor (CSPD), the Progressive Heterogeneous Feature Integrator (PHFI) with Global-Local Context Aggregation (GLCA) and Complementary Context Generator (CCG), and a composite loss that incorporates an auxiliary supervision, achieving SoTA results on MFNet, PST900, and KP Day-Night and generalizing to RGB-HHA on NYU-Depth V2. The findings demonstrate improved robustness under poor illumination and clutter, with near real-time performance on powerful GPUs, illustrating the potential of VFMs for RGB-X data fusion in practical autonomous systems.

Abstract

Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.

HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion

TL;DR

This work tackles RGB-T scene parsing by addressing the limitations of symmetric encoders that fail to exploit differences between RGB and thermal modalities. It introduces HAPNet, a hybrid asymmetric architecture that fuses Vision Foundation Model features from RGB with CNN-derived cross-modal spatial priors through four Progressive Heterogeneous Feature Integrators, followed by a Mask Classification Decoder and an auxiliary local-semantics task. Key contributions include the Cross-modal Spatial Prior Descriptor (CSPD), the Progressive Heterogeneous Feature Integrator (PHFI) with Global-Local Context Aggregation (GLCA) and Complementary Context Generator (CCG), and a composite loss that incorporates an auxiliary supervision, achieving SoTA results on MFNet, PST900, and KP Day-Night and generalizing to RGB-HHA on NYU-Depth V2. The findings demonstrate improved robustness under poor illumination and clutter, with near real-time performance on powerful GPUs, illustrating the potential of VFMs for RGB-X data fusion in practical autonomous systems.

Abstract

Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
Paper Structure (33 sections, 11 equations, 4 figures, 10 tables)

This paper contains 33 sections, 11 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: An overview of our proposed HAPNet architecture.
  • Figure 2: Qualitative comparisons with the SoTA RGB-T scene parsing networks on the MFNet test set, where significantly improved regions are shown with red dashed boxes.
  • Figure 3: Qualitative comparisons with the SoTA RGB-T scene parsing networks on the PST900 test set, where significantly improved regions are shown with red dashed boxes.
  • Figure 4: Qualitative comparisons with the SoTA RGB-T scene parsing networks on the KP Day-Night test set, where significantly improved regions are shown with orange dashed boxes.