Table of Contents
Fetching ...

Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning

Siyi Lu, Lei He, Shengbo Eben Li, Yugong Luo, Jianqiang Wang, Keqiang Li

TL;DR

This paper proposes a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment, and proposes a BEV-based system that extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control.

Abstract

End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird's-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.

Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning

TL;DR

This paper proposes a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment, and proposes a BEV-based system that extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control.

Abstract

End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird's-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.
Paper Structure (16 sections, 2 equations, 4 figures, 2 tables)

This paper contains 16 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our perception-driven end-to-end autonomous driving model build on deep reinforcement learning, proposed a feature extraction network based on the bird's-eye view space to process the input surround camera images, output high-dimensional features to the reinforcement learning strategy network, and directly output control information for controlling the vehicle's throttle, brake, and steering wheel.
  • Figure 2: Neural network architecture of the proposed framework. On the left is the architecture of deep reinforcement learning, and on the right is the architecture of the BEV feature extraction network.
  • Figure 3: Change curve of the reward function of DRL and Ours-3 method during reinforcement learning training
  • Figure 4: The illustration of the Interpretability of our approach. Each sampling frame is randomly selected from the experiment. The six photos in each sampling frame are taken by a set of surround cameras. The picture on the right is the semantic segmentation result generated by these six pictures.