Table of Contents
Fetching ...

Learning a Driving Simulator

Eder Santana, George Hotz

TL;DR

The paper tackles learning a driving simulator by first embedding real road-frame videos into a Gaussian latent space using a VAE-GAN hybrid autoencoder, then learning latent-space transitions with an action-conditioned RNN. By training the autoencoder with GAN-based costs, the model generates more realistic frames than purely MSE-based approaches, while the RNN predicts future latent codes to render subsequent frames. The results show the approach can maintain road structure over many frames (up to ~100) but has difficulty with curved trajectories, highlighting the need for more advanced sequence models and sensor fusion. The authors release a driving dataset and training code to encourage further exploration of learned video prediction for driving simulation.

Abstract

Comma.ai's approach to Artificial Intelligence for self-driving cars is based on an agent that learns to clone driver behaviors and plans maneuvers by simulating future events in the road. This paper illustrates one of our research approaches for driving simulation. One where we learn to simulate. Here we investigate variational autoencoders with classical and learned cost functions using generative adversarial networks for embedding road frames. Afterwards, we learn a transition model in the embedded space using action conditioned Recurrent Neural Networks. We show that our approach can keep predicting realistic looking video for several frames despite the transition model being optimized without a cost function in the pixel space.

Learning a Driving Simulator

TL;DR

The paper tackles learning a driving simulator by first embedding real road-frame videos into a Gaussian latent space using a VAE-GAN hybrid autoencoder, then learning latent-space transitions with an action-conditioned RNN. By training the autoencoder with GAN-based costs, the model generates more realistic frames than purely MSE-based approaches, while the RNN predicts future latent codes to render subsequent frames. The results show the approach can maintain road structure over many frames (up to ~100) but has difficulty with curved trajectories, highlighting the need for more advanced sequence models and sensor fusion. The authors release a driving dataset and training code to encourage further exploration of learned video prediction for driving simulation.

Abstract

Comma.ai's approach to Artificial Intelligence for self-driving cars is based on an agent that learns to clone driver behaviors and plans maneuvers by simulating future events in the road. This paper illustrates one of our research approaches for driving simulation. One where we learn to simulate. Here we investigate variational autoencoders with classical and learned cost functions using generative adversarial networks for embedding road frames. Afterwards, we learn a transition model in the embedded space using action conditioned Recurrent Neural Networks. We show that our approach can keep predicting realistic looking video for several frames despite the transition model being optimized without a cost function in the pixel space.

Paper Structure

This paper contains 8 sections, 5 equations, 4 figures.

Figures (4)

  • Figure 1: $80 \times 160$ samples from the driving dataset.
  • Figure 2: Driving simulator model: an autoencoder trained with generative adversarial costs coupled with a recurrent neural network transition model
  • Figure 3: Samples using similar fully convolutional autoencoders. Odd columns show decoded images, even columns show target images. Models were trained using (a) generative adversarial networks cost function (b) mean square error. Both models have MSE in the order of $10^{-2}$ and PSNR in the order of $10$.
  • Figure 4: Samples generated by letting the transition model hallucinate and decoding $\tilde{z}$ using $Gen$. Note that $Gen$ was not optimized to make these samples realistic, which support our assumption that the transition model did not leave the code space.