Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Hu Cao; Zehua Zhang; Yan Xia; Xinyi Li; Jiahao Xia; Guang Chen; Alois Knoll

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Hu Cao, Zehua Zhang, Yan Xia, Xinyi Li, Jiahao Xia, Guang Chen, Alois Knoll

TL;DR

This work proposes a novel hierarchical feature refinement network for event-frame fusion, which surpasses the state-of-the-art by an impressive margin and exhibits significantly better robustness when introducing 15 different corruption types to the frame images.

Abstract

In frame-based vision, object detection faces substantial performance degradation under challenging conditions due to the limited sensing capability of conventional cameras. Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. However, effectively fusing two heterogeneous modalities remains an open issue. In this work, we propose a novel hierarchical feature refinement network for event-frame fusion. The core concept is the design of the coarse-to-fine fusion module, denoted as the cross-modality adaptive feature refinement (CAFR) module. In the initial phase, the bidirectional cross-modality interaction (BCI) part facilitates information bridging from two distinct sources. Subsequently, the features are further refined by aligning the channel-level mean and variance in the two-fold adaptive feature refinement (TAFR) part. We conducted extensive experiments on two benchmarks: the low-resolution PKU-DDD17-Car dataset and the high-resolution DSEC dataset. Experimental results show that our method surpasses the state-of-the-art by an impressive margin of $\textbf{8.0}\%$ on the DSEC dataset. Besides, our method exhibits significantly better robustness (\textbf{69.5}\% versus \textbf{38.7}\%) when introducing 15 different corruption types to the frame images. The code can be found at the link (https://github.com/HuCaoFighting/FRN).

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

TL;DR

Abstract

on the DSEC dataset. Besides, our method exhibits significantly better robustness (\textbf{69.5}\% versus \textbf{38.7}\%) when introducing 15 different corruption types to the frame images. The code can be found at the link (https://github.com/HuCaoFighting/FRN).

Paper Structure (17 sections, 13 equations, 12 figures, 10 tables)

This paper contains 17 sections, 13 equations, 12 figures, 10 tables.

Introduction
Related Work
Method
Preliminaries
Hierarchical Feature Refinement Network
Experiments
Datasets
Evaluation Metrics
Ablation Study
Comparison with SOTA Methods
Conclusion
More Experimental Details
Datasets
Training details.
More Experimental Analysis
...and 2 more sections

Figures (12)

Figure 1: This work leverages the complementary information from both events and frames for object detection. (Left) In each image pair, the left image is from frames, while the right is from events. Note that event cameras excel at high-speed and high-dynamic range sensing but struggle to capture static and remote small targets compared to RGB cameras. (Right) We choose three methods, FPN-Fusion FPN_Fusion, RENet zhou2023rgb, and EFNet sun2022event for performance evaluation.
Figure 2: Feature maps of RGB and event modalities before and after CAFR. The first row corresponds to the day scene, and the last row represents the night scene.
Figure 3: The overall architecture of our hierarchical feature refinement network. It comprises a dual-stream backbone network, CAFR, FPN, and a detection head. The backbone incorporates two branches: the event-based ResNet-50 (bottom) and the frame-based ResNet-50 resnet (top). The CAFR operates to enhance features on a hierarchical scale. The refined multi-scale features are then forwarded to the FPN and detection head for accurate detection predictions. The structure of the FPN and the detection head is adapted from RetinaNet.
Figure 4: Cross-modality adaptive feature refinement module (CAFR). It contains two integral parts: bidirectional cross-modality interaction (BCI) and two-fold adaptive feature refinement (TAFR). Here, "FR" denotes feature refinement.
Figure 5: Different network architecture designs for the fusion module. It includes: (a) a single branch consisting of only frame-dominated CrossAtt and FR; and (b) a single branch consisting of only event-dominated CrossAtt and FR.
...and 7 more figures

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

TL;DR

Abstract

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (12)