Table of Contents
Fetching ...

A Novel Vision Transformer for Camera-LiDAR Fusion based Traffic Object Segmentation

Toomas Tahves, Junyi Gu, Mauro Bellone, Raivo Sell

TL;DR

This work tackles robust traffic object segmentation for autonomous driving by fusing camera and LiDAR data using a vision-transformer framework. The Camera-LiDAR Fusion Transformer (CLFT) introduces an embedding, encoder, and decoder architecture with cross-fusion to integrate multimodal features, extended to cyclists, signs, and pedestrians under diverse weather. Empirical results on Waymo Open Dataset show that CLFT variants, especially the Hybrid configuration, achieve higher segmentation accuracy and resilience compared to CNN and single-modality Transformer baselines, with notable gains in rain and night scenarios. While promising, the approach requires substantial computing resources and exhibits variability under severe conditions, motivating further optimization and exploration of additional sensor modalities for practical deployment.

Abstract

This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.

A Novel Vision Transformer for Camera-LiDAR Fusion based Traffic Object Segmentation

TL;DR

This work tackles robust traffic object segmentation for autonomous driving by fusing camera and LiDAR data using a vision-transformer framework. The Camera-LiDAR Fusion Transformer (CLFT) introduces an embedding, encoder, and decoder architecture with cross-fusion to integrate multimodal features, extended to cyclists, signs, and pedestrians under diverse weather. Empirical results on Waymo Open Dataset show that CLFT variants, especially the Hybrid configuration, achieve higher segmentation accuracy and resilience compared to CNN and single-modality Transformer baselines, with notable gains in rain and night scenarios. While promising, the approach requires substantial computing resources and exhibits variability under severe conditions, motivating further optimization and exploration of additional sensor modalities for practical deployment.

Abstract

This paper presents Camera-LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self-attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state-of-the-art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.
Paper Structure (14 sections, 7 equations, 3 figures, 4 tables)

This paper contains 14 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Embedding process for camera and LiDAR data. (a) The original image is resized to a resolution of $384 \times 384$ to standardize the input dimensions. (b) The input image is segmented into non-overlapping fixed-size patches of $16 \times 16$ pixels. (c) Patches are flattened into one-dimensional embedded vectors, with an additional positional embedding (colored in orange) added to provide spatial information. (d) The combined patch embeddings are processed through Multilayer Perceptrons (MLPs) with dimensions $E = \Bar{D} \times D$, resulting in a matrix that serves as the input for the transformer encoder. The whole figure is based on the CLFT-Base variant.
  • Figure 2: Encoder process. (a) The output from embedding is normalized and passed through linear layers into the multi-head attention block. (b) The matrix is split into KQV matrices, upon which SoftMax and attention operations are performed. The KQV matrices are then reshaped into a single matrix. (c) Finally, linear operations are executed, and the result is processed through the MLP block.
  • Figure 3: Decoder process. (a) The input tensor, representing data, is concatenated with classification tokens. (b) These tokens are then concatenated based on their positional information, yielding an image-like representation. Two convolution operations, along with up-sampling and down-sampling, are applied. (c) Cross-fusion is applied to combine camera and LiDAR data, progressively integrating outputs from residual computation units from previous steps. The final predicted segmentation is computed through deconvolution and up-sampling blocks.