SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Yan Gong, Xinyu Zhang, Hao Liu, Xinmin Jiang, Zhiwei Li, Xin Gao, Lei Lin, Dafeng Jin, Jun Li, Huaping Liu
TL;DR
SkipcrossNets tackles the challenge of fusing LiDAR and camera data for road detection by introducing an adaptive skip-cross fusion framework that connects every layer to every other layer across modalities. By projecting LiDAR into Altitude Difference Images (ADIs), the method reduces the data-space gap and enables dense cross-modal interactions through learnable fusion weights, improving feature reuse, especially for sparse point clouds. Evaluated on KITTI and A2D2, SkipcrossNets delivers competitive MaxF and F1 scores with a remarkably small model footprint (2.33 MB) and real-time speed (68.24 FPS), outperforming several state-of-the-art fusion strategies. The approach is validated through extensive ablations, cross-dataset analyses, and demonstrations of robustness across varying sensor densities, highlighting its practical potential for autonomous driving on mobile or embedded platforms.
Abstract
Multi-modal fusion is increasingly being used for autonomous driving tasks, as different modalities provide unique information for feature extraction. However, the existing two-stream networks are only fused at a specific network layer, which requires a lot of manual attempts to set up. As the CNN goes deeper, the two modal features become more and more advanced and abstract, and the fusion occurs at the feature level with a large gap, which can easily hurt the performance. To reduce the loss of height and depth information during the process of projecting point clouds into 2D space, we utilize calibration parameters to project the point cloud into Altitude Difference Images (ADIs), which exhibit more distinct road features. In this study, we propose a novel fusion architecture called Skip-cross Networks (SkipcrossNets), which combine adaptively ADIs and camera images without being bound to a certain fusion epoch. Specifically, skip-cross fusion strategy connects each layer to each layer in a feed-forward manner, and for each layer, the feature maps of all previous layers are used as input and its own feature maps are used as input to all subsequent layers for the other modality, enhancing feature propagation and multi-modal features fusion. This strategy facilitates selection of the most similar feature layers from two modalities, enhancing feature reuse and providing complementary effects for sparse point cloud features. The advantages of skip-cross fusion strategy is demonstrated through application to the KITTI and A2D2 datasets, achieving a MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model parameters require only 2.33 MB of memory at a speed of 68.24 FPS, which can be viable for mobile terminals and embedded devices.
