Table of Contents
Fetching ...

CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous Driving

Junyi Gu, Mauro Bellone, Tomáš Pivoňka, Raivo Sell

TL;DR

This work introduces CLFT, a vision-transformer-based camera-LiDAR fusion framework for semantic segmentation in autonomous driving. It employs a progressive-assemble, double-direction encoder-decoder architecture with cross-fusion across decoder layers to fuse camera and LiDAR representations, using LiDAR projections onto $XY$, $YZ$, and $XZ$ planes. Evaluated on the Waymo Open Dataset with illumination/weather splits, CLFT-Hybrid achieves higher IoU than FCN-based fusion and Panoptic SegFormer, particularly improving underrepresented human classes and in adverse conditions, while incurring modest inference-time costs. The approach advances multimodal transformer-based perception and demonstrates practical robustness for real-world driving scenarios, with potential extensions to broader autonomy stacks.

Abstract

Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation results over the vehicle and human classes in different modalities: camera-only, LiDAR-only, and camera-LiDAR fusion. We perform coherent controlled benchmark experiments of CLFT against other networks that are also designed for semantic segmentation. The experiments aim to evaluate the performance of CLFT independently from two perspectives: multimodal sensor fusion and backbone architectures. The quantitative assessments show our CLFT networks yield an improvement of up to 10% for challenging dark-wet conditions when comparing with Fully-Convolutional-Neural-Network-based (FCN) camera-LiDAR fusion neural network. Contrasting to the network with transformer backbone but using single modality input, the all-around improvement is 5-10%.

CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous Driving

TL;DR

This work introduces CLFT, a vision-transformer-based camera-LiDAR fusion framework for semantic segmentation in autonomous driving. It employs a progressive-assemble, double-direction encoder-decoder architecture with cross-fusion across decoder layers to fuse camera and LiDAR representations, using LiDAR projections onto , , and planes. Evaluated on the Waymo Open Dataset with illumination/weather splits, CLFT-Hybrid achieves higher IoU than FCN-based fusion and Panoptic SegFormer, particularly improving underrepresented human classes and in adverse conditions, while incurring modest inference-time costs. The approach advances multimodal transformer-based perception and demonstrates practical robustness for real-world driving scenarios, with potential extensions to broader autonomy stacks.

Abstract

Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation results over the vehicle and human classes in different modalities: camera-only, LiDAR-only, and camera-LiDAR fusion. We perform coherent controlled benchmark experiments of CLFT against other networks that are also designed for semantic segmentation. The experiments aim to evaluate the performance of CLFT independently from two perspectives: multimodal sensor fusion and backbone architectures. The quantitative assessments show our CLFT networks yield an improvement of up to 10% for challenging dark-wet conditions when comparing with Fully-Convolutional-Neural-Network-based (FCN) camera-LiDAR fusion neural network. Contrasting to the network with transformer backbone but using single modality input, the all-around improvement is 5-10%.
Paper Structure (17 sections, 6 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 6 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: The overall architecture of our double-direction network shows camera data flowing from the left side into the ViT encoder, while LiDAR data flows from the right. The camera input is individual RGB channels, and the LiDAR input stands as XY, YZ, and XZ projection planes. The cross-fusion strategy is shown in the center and highlighted using a dashed rectangle.
  • Figure 2: Assemble architecture for each transformer decoder block, tokens of each layers are assembled to image-like representations of feature maps.
  • Figure 3: Each fusion block receives data from the previous stage and integrates camera-LiDAR data coming from the ViT encoder. Each of this block has residual units, de-convolution, and up-sampling.
  • Figure 4: Examples of camera image, semantic annotation mask, and pre-processing of LiDAR data. (a) is the RGB image. (b) illustrates the object semantic masks obtained from LiDAR ground truth bounding boxes. (c) (e) (g) are LiDAR projection images in X, Y, Z channels, respectively, while (d) (f) (h) are corresponding up-sampled dense images. Please note that for visualization purposes, the grayscale intensity in (c)-(h) is proportionally scaled based on the numerical 3D coordinate values of the LiDAR points.
  • Figure 5: Qualitative comparison of segmentation results between different models.