Table of Contents
Fetching ...

Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, Fei-Yue Wang

TL;DR

This work introduces bio-inspired event cameras and proposes a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection.

Abstract

Traffic object detection under variable illumination is challenging due to the information loss caused by the limited dynamic range of conventional frame-based cameras. To address this issue, we introduce bio-inspired event cameras and propose a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection. Specifically, to mitigate the sparsity or blurriness issues arising from diverse motion states of traffic objects in fixed-interval event sampling methods, we propose the Reliable Structure Generation Network (RSGNet) to generate Speed Invariant Frames (SIF), ensuring the integrity and sharpness of object structures. Next, we design a novel Adaptive Feature Complement Module (AFCM) which guides the adaptive fusion of two modality features to compensate for the information loss in the images by perceiving the global lightness distribution of the images, thereby generating illumination-robust representations. Finally, considering the lack of large-scale and high-quality annotations in the existing event-based object detection datasets, we build a DSEC-Det dataset, which consists of 53 sequences with 63,931 images and more than 208,000 labels for 8 classes. Extensive experimental results demonstrate that our proposed SFNet can overcome the perceptual boundaries of conventional cameras and outperform the frame-based method by 8.0% in mAP50 and 5.9% in mAP50:95. Our code and dataset will be available at https://github.com/YN-Yang/SFNet.

Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion

TL;DR

This work introduces bio-inspired event cameras and proposes a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection.

Abstract

Traffic object detection under variable illumination is challenging due to the information loss caused by the limited dynamic range of conventional frame-based cameras. To address this issue, we introduce bio-inspired event cameras and propose a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection. Specifically, to mitigate the sparsity or blurriness issues arising from diverse motion states of traffic objects in fixed-interval event sampling methods, we propose the Reliable Structure Generation Network (RSGNet) to generate Speed Invariant Frames (SIF), ensuring the integrity and sharpness of object structures. Next, we design a novel Adaptive Feature Complement Module (AFCM) which guides the adaptive fusion of two modality features to compensate for the information loss in the images by perceiving the global lightness distribution of the images, thereby generating illumination-robust representations. Finally, considering the lack of large-scale and high-quality annotations in the existing event-based object detection datasets, we build a DSEC-Det dataset, which consists of 53 sequences with 63,931 images and more than 208,000 labels for 8 classes. Extensive experimental results demonstrate that our proposed SFNet can overcome the perceptual boundaries of conventional cameras and outperform the frame-based method by 8.0% in mAP50 and 5.9% in mAP50:95. Our code and dataset will be available at https://github.com/YN-Yang/SFNet.
Paper Structure (17 sections, 16 equations, 11 figures, 9 tables)

This paper contains 17 sections, 16 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Our SFNet comprises two cascaded steps: reliable structure generation and adaptive feature complement. Left: Given an RGB image and corresponding event stream, high-quality structural information is generated from the event stream. Right: The event feature is adaptively extracted to compensate for information loss in the image modality. The image feature, before complementation, lacks discriminative characteristics due to information loss and contrast reduction across the image. Notable, after the complementation, the feature becomes more discriminative.
  • Figure 2: Fixed time windows for event representation. The pedestrian nearby moves at a faster speed compared to the car in the distance. (b) displays a short time window of $\Delta t = 10 \; ms$, resulting in sparse structures for the car. Conversely, (c) shows a long time window of $\Delta t = 50 \; ms$, resulting in blurry structures for the pedestrian. Notably, our Speed Invariant Frame in (d) shows the complete structures and sharp edges for both pedestrian and car. We utilize red and pink boxes to mark the car and pedestrian, respectively.
  • Figure 3: The pipeline of the proposed Structure-aware Fusion Network (SFNet). The event frame $F$ and event polarity integration $E$ generated from the event stream are concatenated and fed into the Reliable Structure Generation Network (RSGNet see Section \ref{['RSGNet']}) to generate the Speed Invariant Frame (SIF). Then the RGB image and SIF are respectively input into two independent CSPDarkNets to extract modality-specific features. The Adaptive Feature Completion Module (AFCM see Section \ref{['AFCM']}), which consists of an Event Refine Module (ERM) and a Lightness Distribution-aware Attention Module (LDAM) is inserted after the Conv and layer1 stages to perform the fusion of the two modalities. FPN+PANet fuses the features at different resolutions further. Finally, the decoder outputs each detected object's class and bounding box information.
  • Figure 4: Overview of our DSEC-Det dataset, which contains extremely challenging variable lighting conditions and rich object annotations.
  • Figure 5: Proportion of annotated bounding boxes for the DSEC-Det dataset.
  • ...and 6 more figures