Table of Contents
Fetching ...

Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?

Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, Ram Vasudevan

TL;DR

The paper tackles the data annotation bottleneck in autonomous-driving vision by leveraging photo-realistic synthetic data generated from a GTA V-based pipeline to train object detectors. It demonstrates that a detector trained solely on synthetic labels (Faster-RCNN with VGG-16) can outperform a real-data-trained counterpart when evaluated on KITTI, especially as synthetic data volume increases. The authors provide a detailed data-capture and bounding-box refinement workflow, analyze dataset bias, and discuss the broader implications for scalable, label-free training in sensor-based recognition. The work suggests that large-scale synthetic data, when properly annotated, can accelerate deep learning applications in perception without extensive human labeling. Overall, the study indicates a promising path toward domain-generalization and faster development cycles in self-driving perception systems.

Abstract

Deep learning has rapidly transformed the state of the art algorithms used to address a variety of problems in computer vision and robotics. These breakthroughs have relied upon massive amounts of human annotated training data. This time consuming process has begun impeding the progress of these deep learning efforts. This paper describes a method to incorporate photo-realistic computer images from a simulation engine to rapidly generate annotated data that can be used for the training of machine learning algorithms. We demonstrate that a state of the art architecture, which is trained only using these synthetic annotations, performs better than the identical architecture trained on human annotated real-world data, when tested on the KITTI data set for vehicle detection. By training machine learning algorithms on a rich virtual world, real objects in real scenes can be learned and classified using synthetic data. This approach offers the possibility of accelerating deep learning's application to sensor-based classification problems like those that appear in self-driving cars. The source code and data to train and validate the networks described in this paper are made available for researchers.

Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?

TL;DR

The paper tackles the data annotation bottleneck in autonomous-driving vision by leveraging photo-realistic synthetic data generated from a GTA V-based pipeline to train object detectors. It demonstrates that a detector trained solely on synthetic labels (Faster-RCNN with VGG-16) can outperform a real-data-trained counterpart when evaluated on KITTI, especially as synthetic data volume increases. The authors provide a detailed data-capture and bounding-box refinement workflow, analyze dataset bias, and discuss the broader implications for scalable, label-free training in sensor-based recognition. The work suggests that large-scale synthetic data, when properly annotated, can accelerate deep learning applications in perception without extensive human labeling. Overall, the study indicates a promising path toward domain-generalization and faster development cycles in self-driving perception systems.

Abstract

Deep learning has rapidly transformed the state of the art algorithms used to address a variety of problems in computer vision and robotics. These breakthroughs have relied upon massive amounts of human annotated training data. This time consuming process has begun impeding the progress of these deep learning efforts. This paper describes a method to incorporate photo-realistic computer images from a simulation engine to rapidly generate annotated data that can be used for the training of machine learning algorithms. We demonstrate that a state of the art architecture, which is trained only using these synthetic annotations, performs better than the identical architecture trained on human annotated real-world data, when tested on the KITTI data set for vehicle detection. By training machine learning algorithms on a rich virtual world, real objects in real scenes can be learned and classified using synthetic data. This approach offers the possibility of accelerating deep learning's application to sensor-based classification problems like those that appear in self-driving cars. The source code and data to train and validate the networks described in this paper are made available for researchers.

Paper Structure

This paper contains 12 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Sample images captured from the video game based simulation engine proposed in this paper. A range of different times of day are simulated including day, night, morning and dusk. Additionally the engine captures complex weather and lighting scenarios such as driving into the sun, fog, rain and haze.
  • Figure 2: Four different weather types appear in the training data. The simulation can be paused and the weather condition can be varied. Additionally note the depth buffer and object stencil buffer used for annotation capture. In the depth image, the darker the intensity the farther the objects range from the camera. In the stencil buffer, we have artificially applied colors to the image's discrete values which correspond to different object labels for the game. Note that these values cannot be used directly. The process by which these are interpreted is highlighted in Section \ref{['s:buffers']}.
  • Figure 3: An illustration of the pipeline for tight bounding box creation. The engine's original bounding boxes are shown in (a). Since they are loose, we process them before using them as training data. In (b) we see an image from the stencil buffer, the orange pixels have been marked as vehicle enabling us to produce tight contours outlined in green. However, note that the two objects do not receive independent IDs in this pass so we must disambiguate the pixels from the truck and the compact car in a subsequent step. To do this we use the depth shown in (c) where lighter purple indicates closer range. This map is used to help separate the two vehicles, where (e) contains updated contours after processing using depth and (f) contains those same updated contours in the depth frame. Finally, (d) depicts the bounding boxes with the additional small vehicle detections in blue which are all used for training. Full details of the process appear in Section \ref{['s:bbox']}.
  • Figure 4: Heatmaps of the training data's bounding box centroids. These plots show the frequency of cars in different locations in the image. Note the much larger spread of occurrence location in the simulated data (b) than in the real images of Cityscapes (a). In the proposed approach, cars are found in a wide area across the image aiding the network in capturing the diversity of real appearance.
  • Figure 5: These figures depict histograms of the number of detections per frame in the Cityscapes, Simulation, and KITTI data sets. Note the similarity of the simulation and KITTI data set distributions. This may aid the network trained using the simulation data set when evaluated on the KITTI data set.
  • ...and 3 more figures