Table of Contents
Fetching ...

A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking

Kosta Dakic, Kanchana Thilakarathna, Rodrigo N. Calheiros, Teng Joon Lim

TL;DR

This work tackles pedestrian detection and tracking in urban environments using moving drones, addressing the limitations of static-camera benchmarks. It introduces MATRIX, an eight-drone synchronized dataset with real-time camera calibration and a BEV-based multi-view framework that fuses dynamic views to maintain robust detection and tracking. Key contributions include a dynamic camera calibration system, a BEV feature fusion pipeline, and comprehensive evaluation showing strong performance under occlusion and camera motion, plus transfer learning and dropout analyses that demonstrate generalization and robustness. The MATRIX dataset and framework provide a rigorous benchmark and a practical pathway toward robust, scalable multi-drone surveillance in real-world scenarios.

Abstract

Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird's-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90\% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with $\sim$90\% detection and tracking accuracy, as well as successfully tracks $\sim$80\% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.

A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking

TL;DR

This work tackles pedestrian detection and tracking in urban environments using moving drones, addressing the limitations of static-camera benchmarks. It introduces MATRIX, an eight-drone synchronized dataset with real-time camera calibration and a BEV-based multi-view framework that fuses dynamic views to maintain robust detection and tracking. Key contributions include a dynamic camera calibration system, a BEV feature fusion pipeline, and comprehensive evaluation showing strong performance under occlusion and camera motion, plus transfer learning and dropout analyses that demonstrate generalization and robustness. The MATRIX dataset and framework provide a rigorous benchmark and a practical pathway toward robust, scalable multi-drone surveillance in real-world scenarios.

Abstract

Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird's-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90\% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with 90\% detection and tracking accuracy, as well as successfully tracks 80\% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.

Paper Structure

This paper contains 31 sections, 14 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Sample synchronized snapshots from all eight drones at a single timestep in the simple MATRIX dataset. This scenario features 10 pedestrians in an open 15$\times$15 m environment with minimal occlusions and calibration checkerboards (checkered patterns) visible for dynamic camera calibration, providing a baseline for evaluating multi-drone tracking under favorable conditions.
  • Figure 2: Sample synchronized snapshots from all eight drones at a single timestep in the complex MATRIX dataset. This scenario features 40 pedestrians in a 30$\times$30 m environment with a large central architectural obstruction (dark column) creating significant occlusions and dense crowding conditions that challenge tracking robustness. Note the varying perspectives and overlapping coverage areas across drone views.
  • Figure 3: An illustration of the Unreal Engine environment, where (a) is the smaller environment and (b) is the more complex test environment with occlusion.
  • Figure 4: Illustration of the multi-view detection and tracking pipeline. The system processes synchronized frames from multiple drone cameras through two parallel streams: a reference stream that maintains a stable view representation, and a current stream that processes incoming frames. Each stream undergoes perspective transformation to BEV representation, followed by image registration that aligns the current view with the reference using feature matching and homography estimation. The registered multi-view features are then fused spatially and temporally before being decoded into detection heatmaps and tracking predictions. This dual-stream architecture enables the system to maintain spatial consistency despite continuous drone movement while preserving temporal coherence for robust tracking.
  • Figure 5: Visualization of the image registration process, where (a) is the reference image, (b) is the original image from the batch, and (c) is the output after image alignment.
  • ...and 7 more figures