Table of Contents
Fetching ...

Model-Based Imitation Learning for Urban Driving

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, Jamie Shotton

TL;DR

MILE tackles urban driving by learning a camera-only, offline model-based imitation framework that jointly learns a world model and a driving policy. It uses a 3D geometry lifting pipeline to produce a BeV representation and a latent dynamics model to generate diverse, long-horizon predictions, including plans imagined entirely in latent space. The approach achieves state-of-the-art CARLA performance under new towns and weather and demonstrates robust closed-loop imagining capabilities, while revealing how low-dimensional latent states and probabilistic inference contribute to reliable planning and control. The work highlights practical implications for real-world deployment by enabling offline, scalable learning from expert demonstrations and suggesting avenues for reward learning and self-supervision.

Abstract

An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile.

Model-Based Imitation Learning for Urban Driving

TL;DR

MILE tackles urban driving by learning a camera-only, offline model-based imitation framework that jointly learns a world model and a driving policy. It uses a 3D geometry lifting pipeline to produce a BeV representation and a latent dynamics model to generate diverse, long-horizon predictions, including plans imagined entirely in latent space. The approach achieves state-of-the-art CARLA performance under new towns and weather and demonstrates robust closed-loop imagining capabilities, while revealing how low-dimensional latent states and probabilistic inference contribute to reliable planning and control. The work highlights practical implications for real-world deployment by enabling offline, scalable learning from expert demonstrations and suggesting avenues for reward learning and self-supervision.

Abstract

An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile.
Paper Structure (46 sections, 12 equations, 8 figures, 13 tables)

This paper contains 46 sections, 12 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Architecture of MILE. The goal is to infer the latent dynamics$({\mathbf{h}}_{1:T}, {\mathbf{s}}_{1:T})$ that generated the observations ${\mathbf{o}}_{1:T}$, the expert actions ${\mathbf{a}}_{1:T}$ and the bird's-eye view labels ${\mathbf{y}}_{1:T}$. The latent dynamics contains a deterministic history ${\mathbf{h}}_t$ and a stochastic state ${\mathbf{s}}_t$.The inference model, with parameters $\phi$, estimates the posterior distribution of the stochastic state $q({\mathbf{s}}_t | {\mathbf{o}}_{\le t}, {\mathbf{a}}_{<t}) \sim \mathcal{N}(\mu_{\phi}({\mathbf{h}}_{t}, {\mathbf{a}}_{t-1},{\mathbf{x}}_t), \sigma_{\phi}({\mathbf{h}}_{t}, {\mathbf{a}}_{t-1}, {\mathbf{x}}_t)\bm{I})$ with ${\mathbf{x}}_t = e_{\phi}({\mathbf{o}}_t)$. $e_{\phi}$ is the observation encoder that lifts image features to 3D, pools them to bird's-eye view, and compresses to a 1D vector.The generative model, with parameters $\theta$, estimates the prior distribution of the stochastic state $p({\mathbf{s}}_t|{\mathbf{h}}_{t-1}, {\mathbf{s}}_{t-1}) \sim \mathcal{N}(\mu_{\theta}({\mathbf{h}}_{t}, \hat{{\mathbf{a}}}_{t-1}), \sigma_{\theta}({\mathbf{h}}_t, \hat{{\mathbf{a}}}_{t-1})\bm{I})$, with ${\mathbf{h}}_{t} = f_{\theta}({\mathbf{h}}_{t-1}, {\mathbf{s}}_{t-1})$ the deterministic transition, and $\hat{{\mathbf{a}}}_{t-1} = \pi_{\theta}({\mathbf{h}}_{t-1}, {\mathbf{s}}_{t-1})$ the predicted action. It additionally estimates the distributions of the observation $p({\mathbf{o}}_t|{\mathbf{h}}_t, {\mathbf{s}}_t) \sim \mathcal{N}(g_{\theta}({\mathbf{h}}_t, {\mathbf{s}}_t), \bm{I})$, the bird's-eye view segmentation $p({\mathbf{y}}_t| {\mathbf{h}}_t, {\mathbf{s}}_t) \sim \mathrm{Categorical}(l_{\theta}({\mathbf{h}}_t, {\mathbf{s}}_t))$, and the action $p({\mathbf{a}}_t| {\mathbf{h}}_t, {\mathbf{s}}_t) \sim \mathrm{Laplace}(\pi_{\theta}({\mathbf{h}}_t, {\mathbf{s}}_t), \mathbf{1})$.In the diagram, we represented our model observing inputs for $T=2$ timesteps, and then imagining future latent states and actions for one step.
  • Figure 2: Qualitative example of multi-modal predictions, for 8 seconds in the future. BeV segmentation legend: black = ego-vehicle, white = background, gray = road, dark gray=lane marking, blue = vehicles, cyan = pedestrians, green/yellow/red = traffic lights. Ground truth labels (GT) outside the field-of-view of the front camera are masked out. In this example, we visualise two distinct futures predicted by the model: 1) (top row) driving through the green light, 2) (bottom row) stopping because the model imagines the traffic light turning red. Note the light transition from green, to yellow, to red, and also at the last frame $t+8.0\mathrm{s}$ how the traffic light in the left lane turns green.
  • Figure 3: Analysis on the latent state dimension. We report closed-loop driving performance in a new town and new weather in CARLA.
  • Figure 4: Driving in imagination. We report the closed-loop driving performance and perception accuracy in CARLA when the model imagines future states and actions and does not observe a proportion of the images.
  • Figure 5: An example of the model imagining and accurately predicting future states and actions to negotiate a roundabout. When imagining, the model does not observe the image frames, but predicts the future states and actions from its current latent state.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof