Table of Contents
Fetching ...

End-to-End Driving via Self-Supervised Imitation Learning Using Camera and LiDAR Data

Jin Bok Park, Jinkyu Lee, Muhyun Back, Hyunmin Han, David T. Ma, Sang Min Won, Sung Soo Hwang, Il Yong Chun

TL;DR

The paper tackles the challenge of learning end-to-end autonomous driving models without labeled driving commands by introducing self-supervised imitation learning (SSIL) that uses camera data and LiDAR-derived vehicle poses. A pseudo-label predictor g is constructed as a composition of h1 (LiDAR-based pose estimation) and h2 (steering angle estimation via Ackermann geometry), enabling supervision-free training of f, the driving network, conditioned on high-level instructions. Experiments on A2D2, nuScenes, and CARLA show SSIL achieves driving performance comparable to supervised imitation learning (SIL) while outperforming PID-based pseudo-labels, and extending SSIL to self-supervised feature learning further improves representations. The approach offers a practical path to scalable E2E driving by leveraging domain knowledge and multi-sensor data, with limitations tied to SLAM reliability and vehicle parameter generalization. Overall, SSIL demonstrates that carefully designed pseudo-labels from pose and geometry can substitute for direct driving commands in end-to-end driving systems, enabling broader data applicability and potential improvements in real-world deployment.

Abstract

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this letter proposes the first fully self-supervised learning framework, self-supervised imitation learning (SSIL), for E2E driving, based on the self-supervised regression learning (SSRL) framework.The proposed SSIL framework can learn E2E driving networks \emph{without} using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose two E2E driving networks that predict driving commands depending on high-level instruction. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves \emph{very} comparable E2E driving accuracy with the supervised learning counterpart. The proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller.

End-to-End Driving via Self-Supervised Imitation Learning Using Camera and LiDAR Data

TL;DR

The paper tackles the challenge of learning end-to-end autonomous driving models without labeled driving commands by introducing self-supervised imitation learning (SSIL) that uses camera data and LiDAR-derived vehicle poses. A pseudo-label predictor g is constructed as a composition of h1 (LiDAR-based pose estimation) and h2 (steering angle estimation via Ackermann geometry), enabling supervision-free training of f, the driving network, conditioned on high-level instructions. Experiments on A2D2, nuScenes, and CARLA show SSIL achieves driving performance comparable to supervised imitation learning (SIL) while outperforming PID-based pseudo-labels, and extending SSIL to self-supervised feature learning further improves representations. The approach offers a practical path to scalable E2E driving by leveraging domain knowledge and multi-sensor data, with limitations tied to SLAM reliability and vehicle parameter generalization. Overall, SSIL demonstrates that carefully designed pseudo-labels from pose and geometry can substitute for direct driving commands in end-to-end driving systems, enabling broader data applicability and potential improvements in real-world deployment.

Abstract

In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this letter proposes the first fully self-supervised learning framework, self-supervised imitation learning (SSIL), for E2E driving, based on the self-supervised regression learning (SSRL) framework.The proposed SSIL framework can learn E2E driving networks \emph{without} using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose two E2E driving networks that predict driving commands depending on high-level instruction. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves \emph{very} comparable E2E driving accuracy with the supervised learning counterpart. The proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller.
Paper Structure (27 sections, 1 theorem, 14 equations, 6 figures, 5 tables)

This paper contains 27 sections, 1 theorem, 14 equations, 6 figures, 5 tables.

Key Result

Theorem 1

Suppose that $f$ and $g$ are measurable. Then the expected prediction error of the optimal solution $f^\star$ of SSRL (Eq. (sys:ssrl)), $f^\star ({\mathbf{x}}) = \mathbb{E}[ g({\mathbf{x}}_J) | {\mathbf{x}}_{J^c} ]$ for each $J \in \mathcal{J}$, at an unseen input ${\mathbf{x}}'$ is given by where $f^\ast$ is the optimal solution of the supervision counterpart, $f^\ast ({\mathbf{x}}) = \mathbb{E}

Figures (6)

  • Figure 1: Overview of the proposed SSIL framework. We modify the general SSRL framework Chunetal:22arxiv by designing a pseudo-label predictor using domain knowledge. We designed it as the composition of a vehicle pose estimator using point clouds ${\mathbf{x}}_{J}$ generated from LiDAR sensor(s), and a pseudo steering angle predictor. In comparing SSIL with the ordinary supervised learning method, we tested two E2E driving network architectures, one without the attention mechanism of transformer and the other with it. To train an E2E driving network that takes a camera image ${\mathbf{x}}_{J^c}$ as input and predicts a steering angle, a loss function measures the discrepancy between pseudo target $\hat{y}$ and the predicted steering angle.
  • Figure 2: Geometric illustration of calculating the turning radius in the $X$-$Y$ vehicle coordinate system where ${\mathbf{P}}_{i-1,i-1}$$= {\mathbf{I}}$. The matrices ${\mathbf{P}}_{i,i-1}$ and ${\mathbf{P}}_{i-1,i-1}$ indicate the vehicle poses at the current and previous time points, respectively. The vector ${\mathbf{d}}_i$ indicates the forward direction vector of the vehicle at the current time point. The distance $r_i$ is a turning radius of the vehicle at the $i$th time point. It is given by the $X$-intercept of the line that is perpendicular to ${\mathbf{d}}_i$ and crosses ${\mathbf{t}}_{i,i-1}$ (the translation vector of ${\mathbf{P}}_{i,i-1}$) in the $X$-$Y$ vehicle coordinate system. The blue dotted arrow indicates a driving trajectory from the $(i-1)$th to $i$th time point.
  • Figure 3: The Ackermann steering geometry Vehicle_dynamics for four-wheeled vehicles using front-wheel drive. We set ${l}_{\text{wb}}$ as the actual wheelbase of the vehicle, and calculate $r_i$ as in Eq. (\ref{['sys:radius']}). In calculating $\delta_{i}$, we consider $l_{\text{wb}}$ and $r_i$ as the length of the opposite side and the length of the hypotenuse, respectively.
  • Figure 4: Steering wheel and front wheels. There exists a specific relation between steering wheel and front wheel angles.
  • Figure 5: The overall architectures of modified PilotNet and modified Latent TransFuser. (a) The modified PilotNet uses a feature extractor for an RGB input at the current time point $i$, and a conditional module that selects a FCN depending on a high-level instruction. (b) The modified Latent TransFuser uses 1) two feature extractors, one for an RGB image at the current time point $i$ and the other for two-channel positional encoding, 2) four transformers in the fusion module that incorporates the feature maps extracted from an RGB image and two-channel positional encoding, and 3) a conditional module that predicts a steering angle depending on a high-level instruction.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: Expected prediction error Chunetal:22arxiv