Table of Contents
Fetching ...

Detecting Every Object from Events

Haitian Zhang, Chang Xu, Xinya Wang, Bingde Liu, Guang Hua, Lei Yu, Wen Yang

TL;DR

This work tackles class-agnostic open-world object detection (CAOD) using event cameras to handle fast-moving objects and challenging illumination. It introduces DEOE, a two-head architecture with a Disentangled Objectness Head and a Dual Regressor Head, leveraging spatio-temporal consistency and potential-sample screening to discover unknown objects in event streams. The approach achieves superior performance over strong baselines across multiple settings and demonstrates strong generalization in cross-dataset tests, while maintaining high inference speeds suitable for real-time perception. This work broadens CAOD to event-based vision and highlights the value of temporal information for open-world object localization in safety-critical applications.

Abstract

Object detection is critical in autonomous driving, and it is more practical yet challenging to localize objects of unknown categories: an endeavour known as Class-Agnostic Object Detection (CAOD). Existing studies on CAOD predominantly rely on ordinary cameras, but these frame-based sensors usually have high latency and limited dynamic range, leading to safety risks in real-world scenarios. In this study, we turn to a new modality enabled by the so-called event camera, featured by its sub-millisecond latency and high dynamic range, for robust CAOD. We propose Detecting Every Object in Events (DEOE), an approach tailored for achieving high-speed, class-agnostic open-world object detection in event-based vision. Built upon the fast event-based backbone: recurrent vision transformer, we jointly consider the spatial and temporal consistencies to identify potential objects. The discovered potential objects are assimilated as soft positive samples to avoid being suppressed as background. Moreover, we introduce a disentangled objectness head to separate the foreground-background classification and novel object discovery tasks, enhancing the model's generalization in localizing novel objects while maintaining a strong ability to filter out the background. Extensive experiments confirm the superiority of our proposed DEOE in comparison with three strong baseline methods that integrate the state-of-the-art event-based object detector with advancements in RGB-based CAOD. Our code is available at https://github.com/Hatins/DEOE.

Detecting Every Object from Events

TL;DR

This work tackles class-agnostic open-world object detection (CAOD) using event cameras to handle fast-moving objects and challenging illumination. It introduces DEOE, a two-head architecture with a Disentangled Objectness Head and a Dual Regressor Head, leveraging spatio-temporal consistency and potential-sample screening to discover unknown objects in event streams. The approach achieves superior performance over strong baselines across multiple settings and demonstrates strong generalization in cross-dataset tests, while maintaining high inference speeds suitable for real-time perception. This work broadens CAOD to event-based vision and highlights the value of temporal information for open-world object localization in safety-critical applications.

Abstract

Object detection is critical in autonomous driving, and it is more practical yet challenging to localize objects of unknown categories: an endeavour known as Class-Agnostic Object Detection (CAOD). Existing studies on CAOD predominantly rely on ordinary cameras, but these frame-based sensors usually have high latency and limited dynamic range, leading to safety risks in real-world scenarios. In this study, we turn to a new modality enabled by the so-called event camera, featured by its sub-millisecond latency and high dynamic range, for robust CAOD. We propose Detecting Every Object in Events (DEOE), an approach tailored for achieving high-speed, class-agnostic open-world object detection in event-based vision. Built upon the fast event-based backbone: recurrent vision transformer, we jointly consider the spatial and temporal consistencies to identify potential objects. The discovered potential objects are assimilated as soft positive samples to avoid being suppressed as background. Moreover, we introduce a disentangled objectness head to separate the foreground-background classification and novel object discovery tasks, enhancing the model's generalization in localizing novel objects while maintaining a strong ability to filter out the background. Extensive experiments confirm the superiority of our proposed DEOE in comparison with three strong baseline methods that integrate the state-of-the-art event-based object detector with advancements in RGB-based CAOD. Our code is available at https://github.com/Hatins/DEOE.
Paper Structure (21 sections, 8 equations, 5 figures, 7 tables)

This paper contains 21 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of the DEOE (upper) and illustration of the sampling process in CAOD (lower), note that the entire model is built upon a rapid event-based backbone, recurrent vision transformer (RVT) event_detector_2023_CVPR. Upper: DEOE consists of the Disentangled Objectness Head and the Dual Regressor Head. These two heads perform their tasks while simultaneously providing the metrics "objectness", "Spatial IoU", and "temporal IoU", which are used to identify potential object samples. Lower: In object detection, it's common practice to assign anchors with a high IoU value with the ground truth (GT) as positive samples (green dashed boxes), while those with low IoU values are defined as negative samples (red dashed boxes). However, in the context of CAOD, anchors containing unknown objects may lack annotations and be inappropriately treated as negative samples, referred to as potential samples (orange boxes). In this example, we assume that only "car" is annotated, whereas "trucks" and "two-wheelers" lack annotations.
  • Figure 2: Approach Overview: DEOE comprises two key components: the Disentangled Objectness Head and the Dual Regressor Head. The former generates the "Objectness" metric, while the latter provides the "Spatial IoU" and "Temporal IoU" metrics. These three metrics work in tandem to detect potential novel objects within the images, subsequently assigning positive samples based on them in the training process. The two branches in the objectness head disentangle the foreground/background division task and the foreground discovery task.
  • Figure 3: Qualitative results on example images from Four Hour 1 Mpx dataset (5+2 setting). upper: the detection results of CA-RVT; middle: the detection results of DEOE; bottom: the GT annotations. Note that in GT annotations, three different colored bounding boxes are employed, the Green Boxes signify annotations for known classes (people, cars, and so on), Blue Boxes designate annotations for unknown classes (two-wheelers and bus), Yellow Boxes represent conspicuously wrong annotations including false alarms (signified by "W") and missed detection, while the white bounding boxes in the first two rows represent the prediction results of the two models. The number at the bottom of the box indicates the category of the detected object.
  • Figure 4: A comparison between event cameras and RGB cameras in extreme scenarios. The white boxes in the image denote the model's detection results for known classes, the Blue Boxes designate unknown classes, and the Yellow Boxes represent instances where the image-based model missed detections compared to the event-based model. Note that the model corresponding to event prediction came from the previous "5+2" setting task, while the model for image prediction is a pretrained YOLOX model derived from mmdetection.
  • Figure 5: Some typical failure cases of DEOE on DSEC-Detection dataset. The Red Boxes contain challenging small objects, positioned at a considerable distance and influenced by light. The Yellow Boxes in the middle column denote failed predictions, including false alarms and missed detections. The Green Boxes and Blue Boxes in the right column indicate GT for known and unknown classes, respectively.