Table of Contents
Fetching ...

MC-BEVRO: Multi-Camera Bird Eye View Road Occupancy Detection for Traffic Monitoring

Arpitsinh Vaghela, Duo Lu, Aayush Atul Verma, Bharatesh Chakravarthi, Hua Wei, Yezhou Yang

TL;DR

This work tackles occlusion and limited field of view in roadside traffic perception by proposing a BEV occupancy framework that fuses data from multiple cameras. It compares a late fusion baseline with three early fusion methods and enhances generalization through static background integration, using a synthetic CARLA dataset and rigorous ablations on occupancy map resolution. The approach demonstrates strong improvements over baselines, reveals the value of multi-camera inputs, and shows promising sim-to-real transfer via zero-shot and few-shot fine-tuning on real-world data. The contributions include a scalable dataset, multiple fusion strategies, and practical insights for deploying BEV occupancy in traffic monitoring and management.

Abstract

Single camera 3D perception for traffic monitoring faces significant challenges due to occlusion and limited field of view. Moreover, fusing information from multiple cameras at the image feature level is difficult because of different view angles. Further, the necessity for practical implementation and compatibility with existing traffic infrastructure compounds these challenges. To address these issues, this paper introduces a novel Bird's-Eye-View road occupancy detection framework that leverages multiple roadside cameras to overcome the aforementioned limitations. To facilitate the framework's development and evaluation, a synthetic dataset featuring diverse scenes and varying camera configurations is generated using the CARLA simulator. A late fusion and three early fusion methods were implemented within the proposed framework, with performance further enhanced by integrating backgrounds. Extensive evaluations were conducted to analyze the impact of multi-camera inputs and varying BEV occupancy map sizes on model performance. Additionally, a real-world data collection pipeline was developed to assess the model's ability to generalize to real-world environments. The sim-to-real capabilities of the model were evaluated using zero-shot and few-shot fine-tuning, demonstrating its potential for practical application. This research aims to advance perception systems in traffic monitoring, contributing to improved traffic management, operational efficiency, and road safety.

MC-BEVRO: Multi-Camera Bird Eye View Road Occupancy Detection for Traffic Monitoring

TL;DR

This work tackles occlusion and limited field of view in roadside traffic perception by proposing a BEV occupancy framework that fuses data from multiple cameras. It compares a late fusion baseline with three early fusion methods and enhances generalization through static background integration, using a synthetic CARLA dataset and rigorous ablations on occupancy map resolution. The approach demonstrates strong improvements over baselines, reveals the value of multi-camera inputs, and shows promising sim-to-real transfer via zero-shot and few-shot fine-tuning on real-world data. The contributions include a scalable dataset, multiple fusion strategies, and practical insights for deploying BEV occupancy in traffic monitoring and management.

Abstract

Single camera 3D perception for traffic monitoring faces significant challenges due to occlusion and limited field of view. Moreover, fusing information from multiple cameras at the image feature level is difficult because of different view angles. Further, the necessity for practical implementation and compatibility with existing traffic infrastructure compounds these challenges. To address these issues, this paper introduces a novel Bird's-Eye-View road occupancy detection framework that leverages multiple roadside cameras to overcome the aforementioned limitations. To facilitate the framework's development and evaluation, a synthetic dataset featuring diverse scenes and varying camera configurations is generated using the CARLA simulator. A late fusion and three early fusion methods were implemented within the proposed framework, with performance further enhanced by integrating backgrounds. Extensive evaluations were conducted to analyze the impact of multi-camera inputs and varying BEV occupancy map sizes on model performance. Additionally, a real-world data collection pipeline was developed to assess the model's ability to generalize to real-world environments. The sim-to-real capabilities of the model were evaluated using zero-shot and few-shot fine-tuning, demonstrating its potential for practical application. This research aims to advance perception systems in traffic monitoring, contributing to improved traffic management, operational efficiency, and road safety.

Paper Structure

This paper contains 16 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overview of BEV road occupancy detection using multiple roadside cameras.
  • Figure 2: Samples from the generated multi-camera occupancy dataset showcasing diverse scenes and perspectives.
  • Figure 3: An overview of implemented models (top) the baseline late fusion model predicts per-camera occupancy maps, fused via mean aggregation, (middle) the early fusion model fuses projected features using a neural network, (bottom) the integration of background information is depicted.
  • Figure 4: Qualitative results of experiments comparing (a) predicted occupancy against ground truth for corresponding multi-camera input, (b) predicted occupancy for the area covered by one camera, comparing input from one camera versus input from all cameras, and (c) better resolution in occupancy map with increasing grid size.
  • Figure 5: Real-world data collection. The top row shows the pipeline to get BEV occupancy from aerial images using oriented bounding box detections. The bottom row presents camera setup and predicted BEV occupancy with and without fine-tuning.