Table of Contents
Fetching ...

Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection

Mehmet Kerem Turkcan, Sanjeev Narasimhan, Chengbo Zang, Gyung Hyun Je, Bo Yu, Mahshid Ghasemi, Javad Ghaderi, Gil Zussman, Zoran Kostic

TL;DR

The paper tackles the challenge of high-altitude object detection in dense urban intersections by introducing Constellation, a 13,314-frame dataset captured from a high-elevation camera and transformed to a top-down view with bounding boxes for pedestrians and vehicles. It provides a thorough evaluation of state-of-the-art detectors (notably YOLO-based models) and investigates the impact of pretraining (VisDrone, CARLA, COCO), domain-specific augmentations, and semi-supervised labeling (BoxMask) on performance, including a detailed analysis of performance drift across time periods with changing intersection conditions. Key findings show that pretraining on VisDrone and targeted augmentations yield pedestrian AP gains, while pseudo-labeled data can be competitive with external datasets; however, there remains a notable gap between pedestrian and vehicle detection, and performance drift highlights the need for continual data collection and model updating. The paper reports top results with a pedestrian AP of 92.0% and an mAP of 95.4%, and it releases datasets, code, and baseline models to facilitate further research in safe, real-time high-altitude urban perception, enabling improved safety warnings and traffic analytics in smart cities.

Abstract

We introduce Constellation, a dataset of 13K images suitable for research on detection of objects in dense urban streetscapes observed from high-elevation cameras, collected for a variety of temporal conditions. The dataset addresses the need for curated data to explore problems in small object detection exemplified by the limited pixel footprint of pedestrians observed tens of meters from above. It enables the testing of object detection models for variations in lighting, building shadows, weather, and scene dynamics. We evaluate contemporary object detection architectures on the dataset, observing that state-of-the-art methods have lower performance in detecting small pedestrians compared to vehicles, corresponding to a 10% difference in average precision (AP). Using structurally similar datasets for pretraining the models results in an increase of 1.8% mean AP (mAP). We further find that incorporating domain-specific data augmentations helps improve model performance. Using pseudo-labeled data, obtained from inference outcomes of the best-performing models, improves the performance of the models. Finally, comparing the models trained using the data collected in two different time intervals, we find a performance drift in models due to the changes in intersection conditions over time. The best-performing model achieves a pedestrian AP of 92.0% with 11.5 ms inference time on NVIDIA A100 GPUs, and an mAP of 95.4%.

Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection

TL;DR

The paper tackles the challenge of high-altitude object detection in dense urban intersections by introducing Constellation, a 13,314-frame dataset captured from a high-elevation camera and transformed to a top-down view with bounding boxes for pedestrians and vehicles. It provides a thorough evaluation of state-of-the-art detectors (notably YOLO-based models) and investigates the impact of pretraining (VisDrone, CARLA, COCO), domain-specific augmentations, and semi-supervised labeling (BoxMask) on performance, including a detailed analysis of performance drift across time periods with changing intersection conditions. Key findings show that pretraining on VisDrone and targeted augmentations yield pedestrian AP gains, while pseudo-labeled data can be competitive with external datasets; however, there remains a notable gap between pedestrian and vehicle detection, and performance drift highlights the need for continual data collection and model updating. The paper reports top results with a pedestrian AP of 92.0% and an mAP of 95.4%, and it releases datasets, code, and baseline models to facilitate further research in safe, real-time high-altitude urban perception, enabling improved safety warnings and traffic analytics in smart cities.

Abstract

We introduce Constellation, a dataset of 13K images suitable for research on detection of objects in dense urban streetscapes observed from high-elevation cameras, collected for a variety of temporal conditions. The dataset addresses the need for curated data to explore problems in small object detection exemplified by the limited pixel footprint of pedestrians observed tens of meters from above. It enables the testing of object detection models for variations in lighting, building shadows, weather, and scene dynamics. We evaluate contemporary object detection architectures on the dataset, observing that state-of-the-art methods have lower performance in detecting small pedestrians compared to vehicles, corresponding to a 10% difference in average precision (AP). Using structurally similar datasets for pretraining the models results in an increase of 1.8% mean AP (mAP). We further find that incorporating domain-specific data augmentations helps improve model performance. Using pseudo-labeled data, obtained from inference outcomes of the best-performing models, improves the performance of the models. Finally, comparing the models trained using the data collected in two different time intervals, we find a performance drift in models due to the changes in intersection conditions over time. The best-performing model achieves a pedestrian AP of 92.0% with 11.5 ms inference time on NVIDIA A100 GPUs, and an mAP of 95.4%.
Paper Structure (13 sections, 4 figures, 7 tables)

This paper contains 13 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Constellation contains different scenes, with changing time-of-day, weather conditions and background elements for the same camera. (a-d) Different weather and time-of-day conditions; (e-h) changes to the scene background.
  • Figure 2: Examples of two complex crowded frames from Constellation.
  • Figure 3: Different data sources used for experiments: (a) Constellation; (b) CARLA; (c) Stanford Drone Dataset; (d) VisDrone.
  • Figure 4: Comparison of the model performance for (a) YOLOv8x, and (b) YOLOv8n (b) models for different pretraining schemes. Performance for the two model architectures follows similar trends with VisDrone pretraining achieving better results for both architectures.