Table of Contents
Fetching ...

Efficient Fusion and Task Guided Embedding for End-to-end Autonomous Driving

Yipin Guo, Yilin Lang, Qinyuan Ren

TL;DR

EfficientFuser tackles the resource bottlenecks of end-to-end autonomous driving by fusing multi-view visual features with cross-attention and predicting via a decoder-only transformer guided by task-embedded tokens. The approach uses EfficientViT as a lightweight backbone and a dynamic control mechanism that balances waypoint tracking and control actions through learned loss signals. Empirical results on CARLA Town05 show substantial reductions in parameters and FLOPs (e.g., 37.6% and 8.5% of a baseline) with only a minor drop in driving score and competitive safety performance, aided by ablations that highlight the value of cross-attention, learnable prediction tokens, and dynamic weighting. This work presents a practical, scalable path toward real-time, end-to-end autonomous driving on resource-constrained hardware, while noting sim-to-real transfer challenges.

Abstract

To address the challenges of sensor fusion and safety risk prediction, contemporary closed-loop autonomous driving neural networks leveraging imitation learning typically require a substantial volume of parameters and computational resources to run neural networks. Given the constrained computational capacities of onboard vehicular computers, we introduce a compact yet potent solution named EfficientFuser. This approach employs EfficientViT for visual information extraction and integrates feature maps via cross attention. Subsequently, it utilizes a decoder-only transformer for the amalgamation of multiple features. For prediction purposes, learnable vectors are embedded as tokens to probe the association between the task and sensor features through attention. Evaluated on the CARLA simulation platform, EfficientFuser demonstrates remarkable efficiency, utilizing merely 37.6% of the parameters and 8.7% of the computations compared to the state-of-the-art lightweight method with only 0.4% lower driving score, and the safety score neared that of the leading safety-enhanced method, showcasing its efficacy and potential for practical deployment in autonomous driving systems.

Efficient Fusion and Task Guided Embedding for End-to-end Autonomous Driving

TL;DR

EfficientFuser tackles the resource bottlenecks of end-to-end autonomous driving by fusing multi-view visual features with cross-attention and predicting via a decoder-only transformer guided by task-embedded tokens. The approach uses EfficientViT as a lightweight backbone and a dynamic control mechanism that balances waypoint tracking and control actions through learned loss signals. Empirical results on CARLA Town05 show substantial reductions in parameters and FLOPs (e.g., 37.6% and 8.5% of a baseline) with only a minor drop in driving score and competitive safety performance, aided by ablations that highlight the value of cross-attention, learnable prediction tokens, and dynamic weighting. This work presents a practical, scalable path toward real-time, end-to-end autonomous driving on resource-constrained hardware, while noting sim-to-real transfer challenges.

Abstract

To address the challenges of sensor fusion and safety risk prediction, contemporary closed-loop autonomous driving neural networks leveraging imitation learning typically require a substantial volume of parameters and computational resources to run neural networks. Given the constrained computational capacities of onboard vehicular computers, we introduce a compact yet potent solution named EfficientFuser. This approach employs EfficientViT for visual information extraction and integrates feature maps via cross attention. Subsequently, it utilizes a decoder-only transformer for the amalgamation of multiple features. For prediction purposes, learnable vectors are embedded as tokens to probe the association between the task and sensor features through attention. Evaluated on the CARLA simulation platform, EfficientFuser demonstrates remarkable efficiency, utilizing merely 37.6% of the parameters and 8.7% of the computations compared to the state-of-the-art lightweight method with only 0.4% lower driving score, and the safety score neared that of the leading safety-enhanced method, showcasing its efficacy and potential for practical deployment in autonomous driving systems.
Paper Structure (15 sections, 4 equations, 6 figures, 4 tables)

This paper contains 15 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Architecture.Image feature extraction: Consider multi-view RGB images as inputs to EfficientFuser which uses several cross attention for the fusion of feature maps between different views. EfficientViT is used as the visual backbone, which makes it suitable for cross attention while maintaining a low computational resource footprint. Since the main-view and focus-view images are more important, larger backbones are used, and side-view images use smaller backbones. Decoder: A decoder-only transformer is used to make predictions. Visual features and sensor features are input as tokens. Additionally, two trainable vectors are set up as tokens, learning relevant information with other tokens at the early stage in the decoder layer.
  • Figure 2: Fusion with Cross Attention.
  • Figure 3: Decoder-only transformer with task guided learnable vector embedded.
  • Figure 4: The penalties incurred for infractions.
  • Figure 5: Layer0 Head3. Prediction tokens focus on the measurement tokens.
  • ...and 1 more figures