Table of Contents
Fetching ...

MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving

Yiqun Duan, Xianda Guo, Zheng Zhu, Zhen Wang, Yu-Kai Wang, Chin-Teng Lin

TL;DR

MaskFuser addresses the limitations of independent modality branches in end-to-end autonomous driving by introducing a unified semantic token space and cross-modality masked auto-encoder training. It combines a hybrid fusion network with Monotonic-to-BEV translation for early fusion and a shared transformer encoder for late fusion, enabling deep cross-modal interaction. The model is pretrained with masked token reconstruction and auxiliary perception tasks, achieving $DS = 49.05$ and $RC = 92.85\%$ on CARLA LongSet6 and demonstrating robustness under partially damaged sensors. Overall, MaskFuser enhances joint representation, perception detail, and driving stability, suggesting a practical and scalable direction for robust multi-modality fusion in autonomous driving.

Abstract

Current multi-modality driving frameworks normally fuse representation by utilizing attention between single-modality branches. However, the existing networks still suppress the driving performance as the Image and LiDAR branches are independent and lack a unified observation representation. Thus, this paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training. The masked training enhances the fusion representation by reconstruction on masked tokens. Architecturally, a hybrid-fusion network is proposed to combine advantages from both early and late fusion: For the early fusion stage, modalities are fused by performing monotonic-to-BEV translation attention between branches; Late fusion is performed by tokenizing various modalities into a unified token space with shared encoding on it. MaskFuser respectively reaches a driving score of 49.05 and route completion of 92.85% on the CARLA LongSet6 benchmark evaluation, which improves the best of previous baselines by 1.74 and 3.21%. The introduced masked fusion increases driving stability under damaged sensory inputs. MaskFuser outperforms the best of previous baselines on driving score by 6.55 (27.8%), 1.53 (13.8%), 1.57 (30.9%), respectively given sensory masking ratios 25%, 50%, and 75%.

MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving

TL;DR

MaskFuser addresses the limitations of independent modality branches in end-to-end autonomous driving by introducing a unified semantic token space and cross-modality masked auto-encoder training. It combines a hybrid fusion network with Monotonic-to-BEV translation for early fusion and a shared transformer encoder for late fusion, enabling deep cross-modal interaction. The model is pretrained with masked token reconstruction and auxiliary perception tasks, achieving and on CARLA LongSet6 and demonstrating robustness under partially damaged sensors. Overall, MaskFuser enhances joint representation, perception detail, and driving stability, suggesting a practical and scalable direction for robust multi-modality fusion in autonomous driving.

Abstract

Current multi-modality driving frameworks normally fuse representation by utilizing attention between single-modality branches. However, the existing networks still suppress the driving performance as the Image and LiDAR branches are independent and lack a unified observation representation. Thus, this paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training. The masked training enhances the fusion representation by reconstruction on masked tokens. Architecturally, a hybrid-fusion network is proposed to combine advantages from both early and late fusion: For the early fusion stage, modalities are fused by performing monotonic-to-BEV translation attention between branches; Late fusion is performed by tokenizing various modalities into a unified token space with shared encoding on it. MaskFuser respectively reaches a driving score of 49.05 and route completion of 92.85% on the CARLA LongSet6 benchmark evaluation, which improves the best of previous baselines by 1.74 and 3.21%. The introduced masked fusion increases driving stability under damaged sensory inputs. MaskFuser outperforms the best of previous baselines on driving score by 6.55 (27.8%), 1.53 (13.8%), 1.57 (30.9%), respectively given sensory masking ratios 25%, 50%, and 75%.
Paper Structure (22 sections, 6 equations, 10 figures, 3 tables)

This paper contains 22 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overall network structure of MasFuser with hybrid fusion structure for pretraining. The network applies Monotonic-to-BEV Translation (MBT) Attention for early fusion. Features from various modalities are patched into tokens with unified position encoding and masked randomly. A shared transformer encoder is applied to get the perception state of the current environment. For masked pretraining, an additional transformer decoder is applied to reconstruct both original Camera $\&$ LiDAR sensory inputs and auxiliary tasks.
  • Figure 2: The structure of the MBT attention module, where the features from monotonic view are projected into BEV space through a sequence-to-sequence formation.
  • Figure 3: The structure of waypoints prediction network, where dotted line denotes the auxiliary loss.
  • Figure 4: Visualization of the driving process. The two rows list common driving environmental conditions during the day and night, such as pedestrian, heavy traffic, low light, and weather (raining) conditions. For segmentation map, legends are none, vehicle, road, red light, red light, road line, pedestrian, and side walk (white).
  • Figure 5: Averaging attention visualization on joint tokens. The attentions are re-projected into range-view and BEV-view. We simultaneously report the BEV map prediction, depth prediction, and segmentation prediction with the attention maps.
  • ...and 5 more figures