Car-STAGE: Automated framework for large-scale high-dimensional simulated time-series data generation based on user-defined criteria
Asma A. Almutairi, David J. LeBlanc, Arpan Kusari
TL;DR
Car-STAGE introduces a GUI-driven framework built on CARLA to generate large-scale, synchronized multi-sensor time-series data with ground-truth annotations. It replaces the native single-thread CARLA workflow with a synchronous, multi-threaded pipeline and memory-mapped I/O, enabling background data collection, deterministic timing, and scalable throughput. Key contributions include the STAGE-VO visibility annotation algorithm, a 12-module architecture, and empirical speedups over CARLA across frames, cameras, and LiDARs. The approach has practical impact for autonomous driving research by simplifying large-scale data generation with consistent ground-truth labeling and enabling cloud-based storage and analysis.
Abstract
Generating large-scale sensing datasets through photo-realistic simulation is an important aspect of many robotics applications such as autonomous driving. In this paper, we consider the problem of synchronous data collection from the open-source CARLA simulator using multiple sensors attached to vehicle based on user-defined criteria. We propose a novel, one-step framework that we refer to as Car-STAGE, based on CARLA simulator, to generate data using a graphical user interface (GUI) defining configuration parameters to data collection without any user intervention. This framework can utilize the user-defined configuration parameters such as choice of maps, number and configurations of sensors, environmental and lighting conditions etc. to run the simulation in the background, collecting high-dimensional sensor data from diverse sensors such as RGB Camera, LiDAR, Radar, Depth Camera, IMU Sensor, GNSS Sensor, Semantic Segmentation Camera, Instance Segmentation Camera, and Optical Flow Camera along with the ground-truths of the individual actors and storing the sensor data as well as ground-truth labels in a local or cloud-based database. The framework uses multiple threads where a main thread runs the server, a worker thread deals with queue and frame number and the rest of the threads processes the sensor data. The other way we derive speed up over the native implementation is by memory mapping the raw binary data into the disk and then converting the data into known formats at the end of data collection. We show that using these techniques, we gain a significant speed up over frames, under an increasing set of sensors and over the number of spawned objects.
