Table of Contents
Fetching ...

Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing

Sicen Guo, Tianyou Wen, Chuang-Wei Liu, Qijun Chen, Rui Fan

TL;DR

This study tackles RGB-D driving scene parsing by fully leveraging Vision Foundation Models (VFMs) through a non-retraining side adapter architecture, HFIT. HFIT combines a frozen ViT with a Duplex Spatial Prior Extractor (DSPE), Recalibrated Heterogeneous Feature Fusion (RHFF), and Holistic Gated Feature Integration (HGFI) to fuse VFM priors with heterogeneous RGB-D features across multiple scales. The approach demonstrates superior generalization and accuracy on Cityscapes and KITTI Semantics compared to traditional RGB-D methods and VFM-based adapters, confirming that relative depth outputs from VFMs can meaningfully improve scene parsing. The work paves the way for robust VFM-based data fusion in driving scene understanding and outlines potential extensions to LiDAR-informed fusion and language-grounded reasoning for autonomous systems.

Abstract

Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at https://mias.group/HFIT.

Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing

TL;DR

This study tackles RGB-D driving scene parsing by fully leveraging Vision Foundation Models (VFMs) through a non-retraining side adapter architecture, HFIT. HFIT combines a frozen ViT with a Duplex Spatial Prior Extractor (DSPE), Recalibrated Heterogeneous Feature Fusion (RHFF), and Holistic Gated Feature Integration (HGFI) to fuse VFM priors with heterogeneous RGB-D features across multiple scales. The approach demonstrates superior generalization and accuracy on Cityscapes and KITTI Semantics compared to traditional RGB-D methods and VFM-based adapters, confirming that relative depth outputs from VFMs can meaningfully improve scene parsing. The work paves the way for robust VFM-based data fusion in driving scene understanding and outlines potential extensions to LiDAR-informed fusion and language-grounded reasoning for autonomous systems.

Abstract

Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at https://mias.group/HFIT.

Paper Structure

This paper contains 15 sections, 15 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of our proposed HFIT. Through the interaction between recalibrated heterogeneous feature fusion (RHFF) modules and holistic gated feature integration (HGFI) modules, heterogeneous features are fully integrated with the profound prior features.
  • Figure 2: An illustration of our proposed HFIT, consisting of (1) a plain ViT, (2) a duplex spatial prior extractor, (3) recalibrated heterogeneous feature fusion modules, and (4) holistic gated feature integration modules.
  • Figure 3: Qualitative comparisons with SoTA scene parsing approaches on the KITTI Semantics menze2015kitti dataset.
  • Figure 4: Qualitative comparisons with SoTA scene parsing approaches on the Cityscapes cordts2016cityscapes dataset.
  • Figure 5: Probability maps for different classes generated by VFMs and HFIT, where blue indicates high prediction confidence and red represents low prediction confidence.