Table of Contents
Fetching ...

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang

Abstract

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Abstract

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

Paper Structure

This paper contains 64 sections, 5 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Per-frame inter-process data transfer time as a function of concurrent sensor count. Bridge-based co-simulation transimhub exhibits near-linear growth with sensor count due to cross-process serialization, while CARLA-Air remains effectively constant ($<0.5$ ms) owing to its single-process architecture.
  • Figure 2: Platform positioning along simulation fidelity and agent domain breadth. CARLA-Air ($\star$) occupies the high-fidelity, multi-domain quadrant without inter-process bridging. Dashed arrows indicate subsumption of upstream capabilities.
  • Figure 3: Runtime architecture of CARLA-Air. A single engine process hosts both simulation backends, each communicating with its respective Python client via an independent RPC server. CARLAAirGameMode acquires ground simulation functionality through class inheritance and integrates the aerial flight actor through composition. All world actors share a single rendering pipeline.
  • Figure 4: Resolving the UE4 single-game-mode constraint. (a) Both backends provide independent game mode classes; assigning either silently discards the other. (b) CARLAAirGameMode inherits all ground functionality from CARLA's game mode base while composing the aerial flight actor as a spawned world actor.
  • Figure 5: Coordinate frames of the two simulation backends. The transform $T$ requires only a Z-axis sign flip and a centimeter-to-meter scale factor; the forward ($X$/$X_N$) and rightward ($Y$/$Y_E$) axes are aligned across both conventions.
  • ...and 8 more figures