Table of Contents
Fetching ...

Event-to-Video Conversion for Overhead Object Detection

Darryl Hannan, Ragib Arnab, Gavin Parpart, Garrett T. Kenyon, Edward Kim, Yijing Watkins

TL;DR

This work addresses overhead object detection with event cameras, where dense event representations underperform compared with RGB frames due to misalignment with pre-training data. The authors demonstrate that converting event streams to gray-scale video using FireNet and FlowNet significantly closes the gap by aligning the input with pre-trained RGB models, enabling strong detectors (e.g., Cascade RCNN, YOLOv8, DINO) to operate effectively on reconstructed frames. They show that this approach can outperform even event-specific detectors on the overhead task and highlight the importance of representation alignment over solely pursuing end-to-end event-specific architectures. The findings suggest that leveraging large pre-trained models through appropriate event-to-video conversion offers practical, near-term gains for event-based overhead detection and should guide future research toward better cross-domain representation alignment.

Abstract

Collecting overhead imagery using an event camera is desirable due to the energy efficiency of the image sensor compared to standard cameras. However, event cameras complicate downstream image processing, especially for complex tasks such as object detection. In this paper, we investigate the viability of event streams for overhead object detection. We demonstrate that across a number of standard modeling approaches, there is a significant gap in performance between dense event representations and corresponding RGB frames. We establish that this gap is, in part, due to a lack of overlap between the event representations and the pre-training data used to initialize the weights of the object detectors. Then, we apply event-to-video conversion models that convert event streams into gray-scale video to close this gap. We demonstrate that this approach results in a large performance increase, outperforming even event-specific object detection techniques on our overhead target task. These results suggest that better alignment between event representations and existing large pre-trained models may result in greater short-term performance gains compared to end-to-end event-specific architectural improvements.

Event-to-Video Conversion for Overhead Object Detection

TL;DR

This work addresses overhead object detection with event cameras, where dense event representations underperform compared with RGB frames due to misalignment with pre-training data. The authors demonstrate that converting event streams to gray-scale video using FireNet and FlowNet significantly closes the gap by aligning the input with pre-trained RGB models, enabling strong detectors (e.g., Cascade RCNN, YOLOv8, DINO) to operate effectively on reconstructed frames. They show that this approach can outperform even event-specific detectors on the overhead task and highlight the importance of representation alignment over solely pursuing end-to-end event-specific architectures. The findings suggest that leveraging large pre-trained models through appropriate event-to-video conversion offers practical, near-term gains for event-based overhead detection and should guide future research toward better cross-domain representation alignment.

Abstract

Collecting overhead imagery using an event camera is desirable due to the energy efficiency of the image sensor compared to standard cameras. However, event cameras complicate downstream image processing, especially for complex tasks such as object detection. In this paper, we investigate the viability of event streams for overhead object detection. We demonstrate that across a number of standard modeling approaches, there is a significant gap in performance between dense event representations and corresponding RGB frames. We establish that this gap is, in part, due to a lack of overlap between the event representations and the pre-training data used to initialize the weights of the object detectors. Then, we apply event-to-video conversion models that convert event streams into gray-scale video to close this gap. We demonstrate that this approach results in a large performance increase, outperforming even event-specific object detection techniques on our overhead target task. These results suggest that better alignment between event representations and existing large pre-trained models may result in greater short-term performance gains compared to end-to-end event-specific architectural improvements.
Paper Structure (9 sections, 1 figure, 3 tables)

This paper contains 9 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Comparison of the same VisDrone-VID visdrone scene using various input representations. Top Left: Event Count Map. Top Right: FireNet firenet Gray-scale Frame. Bottom: Original RGB Frame.