Table of Contents
Fetching ...

End-to-end Driving via Conditional Imitation Learning

Felipe Codevilla, Matthias Müller, Antonio López, Vladlen Koltun, Alexey Dosovitskiy

TL;DR

Imitation-learning-based driving often suffers from perceptuomotor ambiguity at intersections, making end-to-end policies uncontrollable at test time. The authors propose command-conditioned imitation learning, where a high-level command c guides the perception-to-action mapping, enabling test-time control by a navigator or passenger. They present two architectures (command-input and branched) and show the branched model achieves the best performance in both CARLA simulations and a 1/5-scale truck, with ablations confirming the importance of noise-injected training data and online augmentation. The results indicate that command-conditioned end-to-end driving can be made controllable and robust, suggesting a viable path toward scalable, vision-based autonomous driving.

Abstract

Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learning on high-level command input. At test time, the learned driving policy functions as a chauffeur that handles sensorimotor coordination but continues to respond to navigational commands. We evaluate different architectures for conditional imitation learning in vision-based driving. We conduct experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area. Both systems drive based on visual input yet remain responsive to high-level navigational commands. The supplementary video can be viewed at https://youtu.be/cFtnflNe5fM

End-to-end Driving via Conditional Imitation Learning

TL;DR

Imitation-learning-based driving often suffers from perceptuomotor ambiguity at intersections, making end-to-end policies uncontrollable at test time. The authors propose command-conditioned imitation learning, where a high-level command c guides the perception-to-action mapping, enabling test-time control by a navigator or passenger. They present two architectures (command-input and branched) and show the branched model achieves the best performance in both CARLA simulations and a 1/5-scale truck, with ablations confirming the importance of noise-injected training data and online augmentation. The results indicate that command-conditioned end-to-end driving can be made controllable and robust, suggesting a viable path toward scalable, vision-based autonomous driving.

Abstract

Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learning on high-level command input. At test time, the learned driving policy functions as a chauffeur that handles sensorimotor coordination but continues to respond to navigational commands. We evaluate different architectures for conditional imitation learning in vision-based driving. We conduct experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area. Both systems drive based on visual input yet remain responsive to high-level navigational commands. The supplementary video can be viewed at https://youtu.be/cFtnflNe5fM

Paper Structure

This paper contains 23 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Conditional imitation learning allows an autonomous vehicle trained end-to-end to be directed by high-level commands. (a) We train and evaluate robotic vehicles in the physical world (top) and in simulated urban environments (bottom). (b) The vehicles drive based on video from a forward-facing onboard camera. At the time these images were taken, the vehicle was given the command "turn right at the next intersection". (c) The trained controller handles sensorimotor coordination (staying on the road, avoiding collisions) and follows the provided commands.
  • Figure 2: High-level overview. The controller receives an observation $\mathbf{o}_t$ from the environment and a command $\mathbf{c}_t$. It produces an action $\mathbf{a}_t$ that affects the environment, advancing to the next time step.
  • Figure 3: Two network architectures for command-conditional imitation learning. (a) command input: the command is processed as input by the network, together with the image and the measurements. The same architecture can be used for goal-conditional learning (one of the baselines in our experiments), by replacing the command by a vector pointing to the goal. (b) branched: the command acts as a switch that selects between specialized sub-modules.
  • Figure 4: Noise injection during data collection. We show a fragment from an actual driving sequence from the training set. The plot on the left shows steering control [rad] versus time [s]. In the plot, the red curve is an injected triangular noise signal, the green curve is the driver's steering signal, and the blue curve is the steering signal provided to the car, which is the sum of the driver's control and the noise. Images on the right show the driver's view at three points in time (trajectories overlaid post-hoc for visualization). Between times 0 and roughly 1.0, the noise produces a drift to the right, as illustrated in image (a). This triggers a human reaction, from 1.0 to 2.5 seconds, illustrated in (b). Finally, the car recovers from the disturbance, as shown in (c). Only the driver-provided signal (green curve on the left) is used for training.
  • Figure 5: Simulated urban environments. Town 1 is used for training (left), Town 2 is used exclusively for testing (right). Map on top, view from onboard camera below. Note the difference in visual style.
  • ...and 4 more figures