RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Jianxin Huang; Jiahang Li; Ning Jia; Yuxiang Sun; Chengju Liu; Qijun Chen; Rui Fan

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Jianxin Huang, Jiahang Li, Ning Jia, Yuxiang Sun, Chengju Liu, Qijun Chen, Rui Fan

TL;DR

RoadFormer+ tackles universal RGB-X scene parsing by introducing a Hybrid Feature Decoupling Encoder (HFDE) that uses a weight-sharing backbone and independent global and local feature streams, paired with a dual-branch Multi-Scale Heterogeneous Feature Fusion (MHFF) block that fuses global and local cues via Transformer and CNN pathways. The Global Feature Recalibration Module, Local Feature Fusion Module, and Feature Enhancement and Integration Module collectively recalibrate, fuse, and spatially refine heterogeneous features for robust semantic predictions. Empirically, RoadFormer+ achieves state-of-the-art results on KITTI Road, Cityscapes, MFNet, FMB, and ZJU RGB-X datasets, while reducing learnable parameters by about 65% relative to RoadFormer, and demonstrating strong performance across RGB-Normal, RGB-Thermal, and RGB-Polarization modalities. The approach offers practical benefits for robust, multi-sensor urban scene understanding and is accompanied by publicly available code for reproducibility and broader adoption.

Abstract

Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65\% compared to RoadFormer. Our source code will be publicly available at mias.group/RoadFormerPlus.

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

TL;DR

Abstract

Paper Structure (33 sections, 14 equations, 5 figures, 11 tables)

This paper contains 33 sections, 14 equations, 5 figures, 11 tables.

INTRODUCTION
Background
Existing Challenges and Motivation
Novel Contributions
Article Structure
Related Work
Single-Modal Scene Parsing
Data-Fusion Scene Parsing
METHODOLOGY
Hybrid Feature Decoupling Encoder
Overall Feature Encoding Pipeline
Weight-Sharing Backbone
Global Feature Enhancer
Local Feature Extractor
Multi-Scale Heterogeneous Feature Fusion Block
...and 18 more sections

Figures (5)

Figure 1: An overview of our proposed RoadFormer+ architecture.
Figure 2: An illustration of our proposed multi-scale heterogeneous feature fusion block, consisting of (a) a global feature recalibration module, (b) a local feature fusion module, and (c) a feature enhancement and integration module.
Figure 3: Qualitative comparisons between our proposed RoadFormer+ and other SoTA networks on the Cityscapes validation set, where significantly improved regions are shown with yellow dashed boxes.
Figure 4: Qualitative comparison between our proposed RoadFormer+ and other SoTA networks on the KITTI Road dataset. The results are produced by the official KITTI online benchmark suite. The classifications are visualized with true positives in green, false positives in blue, and false negatives in red.
Figure 5: Qualitative comparisons between our proposed RoadFormer+ and other SoTA networks on the MFNet test set, with significantly improved regions highlighted in red dashed boxes.

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

TL;DR

Abstract

RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)