Table of Contents
Fetching ...

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Zeyu Gao, Yao Mu, Chen Chen, Jingliang Duan, Shengbo Eben Li, Ping Luo, Yanfeng Lu

TL;DR

This work tackles the limits of end-to-end urban autonomous driving, where high-dimensional inputs and unbalanced training data hamper sample efficiency and robustness. It introduces SEM2, a SEMantic Masked recurrent world model that learns driving-relevant latent dynamics via a semantic filter and employs a multi-source sampler to balance common and corner-case data, enabling efficient policy learning through latent imagination. Key contributions include the semantic filter with a decoupled mask mechanism, a variational objective incorporating observation, mask, and reward reconstruction, and an adaptive multi-source data strategy that prevents modal collapse in corner cases. Empirical results on CARLA demonstrate superior sample efficiency and robustness to input perturbations compared with baselines like SAC and DreamerV2.

Abstract

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is challenging to cope with the corner cases during the driving process. To solve the above challenges, we present a SEMantic Masked recurrent world model (SEM2), which introduces a semantic filter to extract key driving-relevant features and make decisions via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

TL;DR

This work tackles the limits of end-to-end urban autonomous driving, where high-dimensional inputs and unbalanced training data hamper sample efficiency and robustness. It introduces SEM2, a SEMantic Masked recurrent world model that learns driving-relevant latent dynamics via a semantic filter and employs a multi-source sampler to balance common and corner-case data, enabling efficient policy learning through latent imagination. Key contributions include the semantic filter with a decoupled mask mechanism, a variational objective incorporating observation, mask, and reward reconstruction, and an adaptive multi-source data strategy that prevents modal collapse in corner cases. Empirical results on CARLA demonstrate superior sample efficiency and robustness to input perturbations compared with baselines like SAC and DreamerV2.

Abstract

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is challenging to cope with the corner cases during the driving process. To solve the above challenges, we present a SEMantic Masked recurrent world model (SEM2), which introduces a semantic filter to extract key driving-relevant features and make decisions via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.
Paper Structure (28 sections, 7 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 28 sections, 7 equations, 13 figures, 4 tables, 2 algorithms.

Figures (13)

  • Figure 1: Structure of the recurrent state space model (RSSM). The latent state $s_{i}$ in RSSM is composed of a deterministic variable $h_{i}$ and a stochastic variable $z_{i}$. The generative process is represented by solid lines, while the inference model is represented by dashed lines. The latent dynamic contains a deterministic path and a stochastic path, which learns historical information efficiently while accurately modeling the stochastic characteristic of the dynamics.
  • Figure 2: The overall structure of SEM2. SEM2 takes the observation $o_{t}$ from the camera and lidar as input and then encodes it into the latent state which contains deterministic variable $h_{t}$ and stochastic variable $z_{t}$. The original features are used to reconstruct the observation. The latent semantic filter extracts the driving-relevant features from the original features and reconstructs the semantic mask $\hat{m}_{t}$ and predicts the reward $\hat{r}_{t}$.
  • Figure 3: Working process of semantic mask filter.
  • Figure 4: The structure of the multi-source sampler for the training of SEM2. In addition to the common replay buffer, there are two corner case replay buffers that save the data in outlane cases and collision cases independently. In every iteration of the training process, we sample mini-batch from the three replay buffers, in turn, to contribute diverse data to support the SEM2 updating. The data distribution will be adjust after the evaluation of SEM2.
  • Figure 5: The semantic masked world model is utilized to acquire knowledge of the policy from sequences imagined in the condensed latent space. These sequences commence from posterior states computed during the model training and advance by generating actions sampled from the actor that possesses filtered driving-relevant features. The critic network estimates the expected rewards for each state by applying temporal difference learning to the imagined rewards. The actor is trained to maximize the expected rewards via the straight-through gradients of the learned world model.
  • ...and 8 more figures