Table of Contents
Fetching ...

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

Yan Gong, Xinyu Zhang, Hao Liu, Xinmin Jiang, Zhiwei Li, Xin Gao, Lei Lin, Dafeng Jin, Jun Li, Huaping Liu

TL;DR

SkipcrossNets tackles the challenge of fusing LiDAR and camera data for road detection by introducing an adaptive skip-cross fusion framework that connects every layer to every other layer across modalities. By projecting LiDAR into Altitude Difference Images (ADIs), the method reduces the data-space gap and enables dense cross-modal interactions through learnable fusion weights, improving feature reuse, especially for sparse point clouds. Evaluated on KITTI and A2D2, SkipcrossNets delivers competitive MaxF and F1 scores with a remarkably small model footprint (2.33 MB) and real-time speed (68.24 FPS), outperforming several state-of-the-art fusion strategies. The approach is validated through extensive ablations, cross-dataset analyses, and demonstrations of robustness across varying sensor densities, highlighting its practical potential for autonomous driving on mobile or embedded platforms.

Abstract

Multi-modal fusion is increasingly being used for autonomous driving tasks, as different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. To reduce the loss of height and depth information during the process of projecting point clouds into 2D space, we utilize calibration parameters to project the point cloud into Altitude Difference Images (ADIs), which exhibit more distinct road features. In this study, we propose a novel fusion architecture called Skip-cross Networks (SkipcrossNets), which combine adaptively ADIs and camera images without being bound to a certain fusion epoch. Specifically, skip-cross fusion strategy connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two modalities, enhancing feature reuse and providing complementary effects for sparse point cloud features. The advantages of skip-cross fusion strategy is demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters require only 2.33 MB of memory at a speed of 68.24 FPS, which can be viable for mobile terminals and embedded devices.

SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection

TL;DR

SkipcrossNets tackles the challenge of fusing LiDAR and camera data for road detection by introducing an adaptive skip-cross fusion framework that connects every layer to every other layer across modalities. By projecting LiDAR into Altitude Difference Images (ADIs), the method reduces the data-space gap and enables dense cross-modal interactions through learnable fusion weights, improving feature reuse, especially for sparse point clouds. Evaluated on KITTI and A2D2, SkipcrossNets delivers competitive MaxF and F1 scores with a remarkably small model footprint (2.33 MB) and real-time speed (68.24 FPS), outperforming several state-of-the-art fusion strategies. The approach is validated through extensive ablations, cross-dataset analyses, and demonstrations of robustness across varying sensor densities, highlighting its practical potential for autonomous driving on mobile or embedded platforms.

Abstract

Multi-modal fusion is increasingly being used for autonomous driving tasks, as different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. To reduce the loss of height and depth information during the process of projecting point clouds into 2D space, we utilize calibration parameters to project the point cloud into Altitude Difference Images (ADIs), which exhibit more distinct road features. In this study, we propose a novel fusion architecture called Skip-cross Networks (SkipcrossNets), which combine adaptively ADIs and camera images without being bound to a certain fusion epoch. Specifically, skip-cross fusion strategy connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two modalities, enhancing feature reuse and providing complementary effects for sparse point cloud features. The advantages of skip-cross fusion strategy is demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters require only 2.33 MB of memory at a speed of 68.24 FPS, which can be viable for mobile terminals and embedded devices.
Paper Structure (23 sections, 5 equations, 6 figures, 7 tables)

This paper contains 23 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: We compare the model parameters and performance of various methods on KITTI and A2D2 scenarios. The red mark part is our proposed method.
  • Figure 2: Results of State-of-the-art models and ours: the first two rows are tested on the KITTI dataset and the rest are tested on the A2D2 dataset. For the sub-pictures (b) and (d), it is obvious that the resulting picture is quite different from the Ground Truth. Sub-pictures (c) have more noise in the second picture. Overall, the third row has a better performance, but for the edges of objects, such as the cars in the third row, the results of SkipcrossNets are smoother.
  • Figure 3: The image pairs of RGB and Altitude Difference Image (ADI).
  • Figure 4: The overall architecture of the SkipcrossNets. The encoder includes three fusion stages, and each fusion stage adopts the skip-cross fusion strategy detailed in Section \ref{['SK']}. Skip-cross fusion directly integrates information from the feature extraction and processes it by training cross-connected branches, which can be fused at any depth, not just limited to a certain level compared with other fusion strategies as shown in Fig. \ref{['fig:Fusion_Stages']}.
  • Figure 5: (a) Early fusion (b) Middle fusion (c) Late fusion. The main differences of the three fusion strategies are in the fusion stages. When a fusion occurs in a deeper network, the features participating in the fusion become more abstract and the model is more flexible. However, effective intermediate and detailed features may also be lost.
  • ...and 1 more figures