Table of Contents
Fetching ...

EMF: Event Meta Formers for Event-based Real-time Traffic Object Detection

Muhammad Ahmed Ullah Khan, Abdul Hannan Khan, Andreas Dengel

TL;DR

This work tackles real-time object detection for event-based cameras, addressing the inefficiencies of transformer-heavy backbones by introducing EMF, an Event Meta Former backbone that is tailored to event data. It combines an Event Progression Extractor with convolution-based MetaFormer blocks (RepMixer and ConvFFN) across four stages, augmented by LSTMs to capture temporal context, and uses YOLOX for detection. On Gen1 and 1Mpx benchmarks, EMF achieves state-of-the-art or near-state-of-the-art performance with significantly reduced inference time and parameter count, while demonstrating strong generalization in cross-dataset tests and improved scaling with larger datasets. Overall, EMF delivers a faster, more data-efficient, and more robust solution for event-based traffic object detection, with practical implications for real-time autonomous driving systems.

Abstract

Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing computationally expensive transformer-based solutions. However, due to their resource-intensive components, these solutions fail to exploit the sparsity and higher temporal resolution of event cameras efficiently. Moreover, these solutions are adopted from the vision domain, lacking specificity to the event cameras. In this work, we explore efficient and performant alternatives to recurrent vision transformer models and propose a novel event-based object detection backbone. The proposed backbone employs a novel Event Progression Extractor module, tailored specifically for event data, and uses Metaformer concept with convolution-based efficient components. We evaluate the resultant model on well-established traffic object detection benchmarks and conduct cross-dataset evaluation to test its ability to generalize. The proposed model outperforms the state-of-the-art on Prophesee Gen1 dataset by 1.6 mAP while reducing inference time by 14%. Our proposed EMF becomes the fastest DNN-based architecture in the domain by outperforming most efficient event-based object detectors. Moreover, the proposed model shows better ability to generalize to unseen data and scales better with the abundance of data.

EMF: Event Meta Formers for Event-based Real-time Traffic Object Detection

TL;DR

This work tackles real-time object detection for event-based cameras, addressing the inefficiencies of transformer-heavy backbones by introducing EMF, an Event Meta Former backbone that is tailored to event data. It combines an Event Progression Extractor with convolution-based MetaFormer blocks (RepMixer and ConvFFN) across four stages, augmented by LSTMs to capture temporal context, and uses YOLOX for detection. On Gen1 and 1Mpx benchmarks, EMF achieves state-of-the-art or near-state-of-the-art performance with significantly reduced inference time and parameter count, while demonstrating strong generalization in cross-dataset tests and improved scaling with larger datasets. Overall, EMF delivers a faster, more data-efficient, and more robust solution for event-based traffic object detection, with practical implications for real-time autonomous driving systems.

Abstract

Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing computationally expensive transformer-based solutions. However, due to their resource-intensive components, these solutions fail to exploit the sparsity and higher temporal resolution of event cameras efficiently. Moreover, these solutions are adopted from the vision domain, lacking specificity to the event cameras. In this work, we explore efficient and performant alternatives to recurrent vision transformer models and propose a novel event-based object detection backbone. The proposed backbone employs a novel Event Progression Extractor module, tailored specifically for event data, and uses Metaformer concept with convolution-based efficient components. We evaluate the resultant model on well-established traffic object detection benchmarks and conduct cross-dataset evaluation to test its ability to generalize. The proposed model outperforms the state-of-the-art on Prophesee Gen1 dataset by 1.6 mAP while reducing inference time by 14%. Our proposed EMF becomes the fastest DNN-based architecture in the domain by outperforming most efficient event-based object detectors. Moreover, the proposed model shows better ability to generalize to unseen data and scales better with the abundance of data.

Paper Structure

This paper contains 27 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Shows the detailed architecture of the proposed EMF backbone, its components (a-f) and usage in event-based object detection pipeline (g).
  • Figure 2: Qualitative comparison of EMF and RVT-B gehrig2023recurrent models against ground-truth (GT) on 1Mpx dataset. The cyan bounding boxes represent cars, while the orange bounding boxes represent pedestrians. The first three columns are from the Gen1 dataset, while the last three columns contain samples from the 1Mpx dataset.