Table of Contents
Fetching ...

Virtual to Real Reinforcement Learning for Autonomous Driving

Xinlei Pan, Yurong You, Ziyan Wang, Cewu Lu

TL;DR

The paper tackles the sim-to-real transfer problem in autonomous driving for reinforcement learning by introducing a two-stage realistic translation network that uses scene parsing as an intermediate representation. Virtual frames are translated to parsing maps and then to realistic frames, enabling an A3C-driving policy trained entirely in simulated visuals to operate in real-world driving. Experiments show that policies trained with translated realistic imagery generalize better to real driving than purely virtual-training or domain-randomized approaches, with transfer learning across virtual environments demonstrating improved robustness. This VR-RL framework offers a safer, cost-effective pathway for training autonomous driving policies before real-world deployment, and highlights future work to diversify appearance while preserving structure to further mitigate biases.

Abstract

Reinforcement learning is considered as a promising direction for driving policy learning. However, training autonomous driving vehicle with reinforcement learning in real environment involves non-affordable trial-and-error. It is more desirable to first train in a virtual environment and then transfer to the real environment. In this paper, we propose a novel realistic translation network to make model trained in virtual environment be workable in real world. The proposed network can convert non-realistic virtual image input into a realistic one with similar scene structure. Given realistic frames as input, driving policy trained by reinforcement learning can nicely adapt to real world driving. Experiments show that our proposed virtual to real (VR) reinforcement learning (RL) works pretty well. To our knowledge, this is the first successful case of driving policy trained by reinforcement learning that can adapt to real world driving data.

Virtual to Real Reinforcement Learning for Autonomous Driving

TL;DR

The paper tackles the sim-to-real transfer problem in autonomous driving for reinforcement learning by introducing a two-stage realistic translation network that uses scene parsing as an intermediate representation. Virtual frames are translated to parsing maps and then to realistic frames, enabling an A3C-driving policy trained entirely in simulated visuals to operate in real-world driving. Experiments show that policies trained with translated realistic imagery generalize better to real driving than purely virtual-training or domain-randomized approaches, with transfer learning across virtual environments demonstrating improved robustness. This VR-RL framework offers a safer, cost-effective pathway for training autonomous driving policies before real-world deployment, and highlights future work to diversify appearance while preserving structure to further mitigate biases.

Abstract

Reinforcement learning is considered as a promising direction for driving policy learning. However, training autonomous driving vehicle with reinforcement learning in real environment involves non-affordable trial-and-error. It is more desirable to first train in a virtual environment and then transfer to the real environment. In this paper, we propose a novel realistic translation network to make model trained in virtual environment be workable in real world. The proposed network can convert non-realistic virtual image input into a realistic one with similar scene structure. Given realistic frames as input, driving policy trained by reinforcement learning can nicely adapt to real world driving. Experiments show that our proposed virtual to real (VR) reinforcement learning (RL) works pretty well. To our knowledge, this is the first successful case of driving policy trained by reinforcement learning that can adapt to real world driving data.

Paper Structure

This paper contains 9 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Framework for virtual to real reinforcement learning for autonomous driving. Virtual images rendered by a simulator (environment) are first segmented to scene parsing representation and then translated to synthetic realistic images by the proposed image translation network (VISRI). Agent observes synthetic realistic images and takes actions. Environment will give reward to the agent. Since the agent is trained using realistic images that are visually similar to real world scenes, it can nicely adapt to real world driving.
  • Figure 2: Example image segmentation for both virtual world images (Left 1 and Left 2) and real world images (Right 1 and Right 2).
  • Figure 3: Reinforcement learning network architecture. The network is an end-to-end network mapping state representations to action probability outputs.
  • Figure 4: Examples of Virtual to Real Image Translation. Odd columns are virtual images captured from TORCS. Even columns are synthetic real world images corresponding to virtual images on the left.
  • Figure 5: Transfer learning between different environments. Oracle was trained in Cgtrack2 and tested in Cgtrack2, so its performance is the best. Our model works better than the domain randomization RL method. Domain randomization method requires training in multiple virtual environments, which imposes significant manual engineering work.