TUMTraf V2X Cooperative Perception Dataset

Walter Zimmer; Gerhard Arya Wardana; Suren Sritharan; Xingcheng Zhou; Rui Song; Alois C. Knoll

TUMTraf V2X Cooperative Perception Dataset

Walter Zimmer, Gerhard Arya Wardana, Suren Sritharan, Xingcheng Zhou, Rui Song, Alois C. Knoll

TL;DR

This work proposes CoopDet3D, a cooperative multi-modal fusion model, and TUMTraf- V2X, a perception dataset, and TUMTraf- V2X, a perception dataset, for the cooperative 3D object detection and tracking task, and makes the dataset, model, labeling tool, and devkit publicly available on the website.

Abstract

Cooperative perception offers several benefits for enhancing the capabilities of autonomous vehicles and improving road safety. Using roadside sensors in addition to onboard sensors increases reliability and extends the sensor range. External sensors offer higher situational awareness for automated vehicles and prevent occlusions. We propose CoopDet3D, a cooperative multi-modal fusion model, and TUMTraf-V2X, a perception dataset, for the cooperative 3D object detection and tracking task. Our dataset contains 2,000 labeled point clouds and 5,000 labeled images from five roadside and four onboard sensors. It includes 30k 3D boxes with track IDs and precise GPS and IMU data. We labeled eight categories and covered occlusion scenarios with challenging driving maneuvers, like traffic violations, near-miss events, overtaking, and U-turns. Through multiple experiments, we show that our CoopDet3D camera-LiDAR fusion model achieves an increase of +14.36 3D mAP compared to a vehicle camera-LiDAR fusion model. Finally, we make our dataset, model, labeling tool, and dev-kit publicly available on our website: https://tum-traffic-dataset.github.io/tumtraf-v2x.

TUMTraf V2X Cooperative Perception Dataset

TL;DR

Abstract

Paper Structure (37 sections, 9 equations, 31 figures, 8 tables)

This paper contains 37 sections, 9 equations, 31 figures, 8 tables.

Introduction
Related work
Single viewpoint datasets
V2X datasets
V2X perception models for object detection
TUMTraf-V2X Dataset
Sensor setup
Sensor calibration and registration
Data selection and labeling
Data structure and format
Dataset development kit
Benchmark
Evaluation metrics
CoopDet3D model
Experiments and ablation studies
...and 22 more sections

Figures (31)

Figure 1: Visualization of 3D box labels and tracks in our TUMTraf V2X Cooperative Perception Dataset. The top part shows the labels projected into the four camera images. The part below shows a point cloud from two LiDARs with 3D box labels of the same scene.
Figure 2: Our CoopDet3D framework is a multi-modal cooperative fusion system, comprising three distinct fusion pipelines. 1) The roadside camera-LiDAR fusion pipeline fuses three camera images and one LiDAR point cloud by extracting features and transforming them into a BEV representation. 2) The vehicle camera-LiDAR fusion pipeline fuses the vehicle camera feature map with the vehicle point cloud feature map using a convolutional fuser. 3) The vehicle and infrastructure feature maps are then fused by applying an element-wise max-pooling operation (Max Fuser). In the end, we use the TransFusion bai2022transfusion 3D detection head to obtain 3D bounding box predictions.
Figure 3: Demonstration of a possible V2X occlusion scenario. A pedestrian (blue) is crossing the road in front of the ego vehicle. An occluded bicycle is marked in red. The recording vehicle with the sensor setup is shown in the bottom left corner.
Figure 4: Our TUMTraf-V2X dataset (version 1.0) contains 25k 3D box labels in total and is balanced among eight different object classes. (a) Cars (11,203) and pedestrians (4,781) are highly represented in the dataset. (b) 3D box labels contain on average 590 points inside, which shows the density of the labeled objects. The BUS class has the highest point density. (c) All traffic participants are tracked for 26 m on average. Buses have the highest average track length of 43 m, whereas the CAR class contains the max. track length of 173.95 m.
Figure 5: Our dataset was recorded at a crowded intersection with many left and right turns. (a) Most of the vehicles (6,160) are driving in the east direction (0 degree). (b) 3D boxes were labeled up to 200 m range and are very dense between 10 and 60 m. (c) The visualization of BEV tracks shows where pedestrians and bicycles are crossing the road.
...and 26 more figures

TUMTraf V2X Cooperative Perception Dataset

TL;DR

Abstract

TUMTraf V2X Cooperative Perception Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (31)