Table of Contents
Fetching ...

UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

Karthikeyan Chandra Sekaran, Markus Geisler, Dominik Rößle, Adithya Mohan, Daniel Cremers, Wolfgang Utschick, Michael Botsch, Werner Huber, Torsten Schön

TL;DR

UrbanIng-V2X addresses the lack of large-scale real-world benchmarks for cooperative perception across multiple urban intersections by introducing a multi-vehicle, multi-infrastructure dataset collected at three Ingolstadt intersections. The dataset includes 34 20-second sequences with two vehicles and up to three infrastructure poles, featuring 12 vehicle RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs, annotated at 10 Hz with 712k 3D boxes across 13 classes, plus an HD map and a CARLA-based digital twin. It provides precise sensor synchronization, calibration, per-point LiDAR motion compensation, and comprehensive annotations with tracking IDs and attributes to support 3D object detection, tracking, trajectory prediction, and localization tasks. Benchmark results reveal a generalization gap when evaluating on unseen intersections, underscoring the need for robust cross-site perception methods; the paper also releases code, dataset, HD maps, and a digital twin to foster research in perception, tracking, prediction, and simulation.

Abstract

Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.

UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

TL;DR

UrbanIng-V2X addresses the lack of large-scale real-world benchmarks for cooperative perception across multiple urban intersections by introducing a multi-vehicle, multi-infrastructure dataset collected at three Ingolstadt intersections. The dataset includes 34 20-second sequences with two vehicles and up to three infrastructure poles, featuring 12 vehicle RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs, annotated at 10 Hz with 712k 3D boxes across 13 classes, plus an HD map and a CARLA-based digital twin. It provides precise sensor synchronization, calibration, per-point LiDAR motion compensation, and comprehensive annotations with tracking IDs and attributes to support 3D object detection, tracking, trajectory prediction, and localization tasks. Benchmark results reveal a generalization gap when evaluating on unseen intersections, underscoring the need for robust cross-site perception methods; the paper also releases code, dataset, HD maps, and a digital twin to foster research in perception, tracking, prediction, and simulation.

Abstract

Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.

Paper Structure

This paper contains 32 sections, 24 figures, 6 tables.

Figures (24)

  • Figure 1: This illustration provides a comprehensive overview of the UrbanIng-V2X cooperative perception dataset environment. For each intersection, a globally fused point cloud of a representative scenario is visualized. Point clouds from individual agents are color-coded, highlighting two vehicles and sensor poles at three intersections as cooperation partners. Further, the complete sensor setup, along with a bird's-eye view of both the HD map and a high-fidelity CARLA map, is shown.
  • Figure 2: Sensor setup and coordinate frame. The left figure shows details of one vehicle, and the right figure shows details of one pole of a crossing. GC describes the geometric center.
  • Figure 3: Result of the spatially calibrated and temporally aligned multi-modal sensor sources. The point cloud image highlights the time deviation in a globally fused cooperative LiDAR frame, particularly critical when LiDARs of multiple agents are capturing the same object. The top row shows the overlaid projections of the point cloud into three exemplary sensor perspectives.
  • Figure 4: Projection of 3D annotations at one timestamp into three exemplary views: front left camera (left), bird’s-eye view fused point cloud (center), and two infrastructure cameras (right) are shown.
  • Figure 5: Trajectories projected onto the HD map of each intersection, color-coded by object category, illustrating the quality, density, and variation across the intersection layouts. In total, $2156$ trajectories of Intersection 1, $1895$ trajectories of Intersection 2, and $835$ tracks of Intersection 3 are shown.
  • ...and 19 more figures