Table of Contents
Fetching ...

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Haitian Zhang, Xiangyuan Wang, Chang Xu, Xinya Wang, Fang Xu, Huai Yu, Lei Yu, Wen Yang

TL;DR

FAOD tackles high-speed object detection by fusing Event cameras with RGB frames while addressing two mismatches: Event-RGB sampling differences and train–test frequency gaps. The method introduces an Align Module that uses AdaIN and deformable convolution to align cross-modal features and a Time Shift training strategy to align predictions from temporally shifted Event–RGB pairs with their original annotations, enabling Event-first predictions with RGB as supplementary information. A shallow EF Fusion with LSTM-based temporal blocks preserves low latency and achieves state-of-the-art results on PKU-DAVIS-SOD and DSEC-Detection, including robustness to 80x frequency mismatch. The work demonstrates that Event–RGB fusion with proper alignment and frequency adaptation generalizes better to high-frequency inference than Event-only approaches, and shows that shallow fusion is more efficient than deep fusion.

Abstract

Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textit{vs.}~high-latency RGB frames; temporally sparse labels in training \textit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbf{F}requency-\textbf{A}daptive Low-Latency \textbf{O}bject \textbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$\times$ Event-RGB frequency mismatch.

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

TL;DR

FAOD tackles high-speed object detection by fusing Event cameras with RGB frames while addressing two mismatches: Event-RGB sampling differences and train–test frequency gaps. The method introduces an Align Module that uses AdaIN and deformable convolution to align cross-modal features and a Time Shift training strategy to align predictions from temporally shifted Event–RGB pairs with their original annotations, enabling Event-first predictions with RGB as supplementary information. A shallow EF Fusion with LSTM-based temporal blocks preserves low latency and achieves state-of-the-art results on PKU-DAVIS-SOD and DSEC-Detection, including robustness to 80x frequency mismatch. The work demonstrates that Event–RGB fusion with proper alignment and frequency adaptation generalizes better to high-frequency inference than Event-only approaches, and shows that shallow fusion is more efficient than deep fusion.

Abstract

Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textit{vs.}~high-latency RGB frames; temporally sparse labels in training \textit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbf{F}requency-\textbf{A}daptive Low-Latency \textbf{O}bject \textbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80 Event-RGB frequency mismatch.

Paper Structure

This paper contains 25 sections, 4 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: FAOD is specifically designed for high-speed object detection using Events and RGB Frames. Top figure demonstrates its ability to handle Event and RGB data with varying frequencies, a primary challenge hindering high-frequency fusion-based object detection. Bottom figure shows that the proposed approach enables the memory network to better generalize from low training frequency to high inference frequencies, unlike Event-based methods event_detector_2023_CVPRevent_detector_2024_SSM which are typically restricted to a fixed frequency.
  • Figure 2: Detection Performance vs. Model Parameters. FAOD achieves the best detection performance in the PKU-DAVIS-SOD dataset with the fewest model parameters among the methods using both Events and RGB images.
  • Figure 3: Overview of the proposed FAOD. (a) FAOD follows a hierarchical structure where unpaired Event frames and RGB images are fused only in the first stage. The fused features are then propagated through the subsequent three (b) CSPDark-LSTM Blocks for multi-scale feature extraction. The (c) Align Module ensures proper alignment between Event and RGB data before fusion in the (d) EF Fusion module. An LSTM Cell is embedded in each stage to capture temporal dependencies.
  • Figure 4: Deep feature fusion framework. Our deep-level fusion network comprises four stages, each incorporating Event and RGB modalities and the fused features from the preceding stage. The final fused features are fed into an FPN and detection head for object detection.
  • Figure 5: Caculation of Aligned Loss. Aligned Loss is computed based on RGB frames that have passed through the Align Module under unpaired conditions and the RGB frames perfectly aligned with the Event frames. This loss serves as supervision for the Align Module.
  • ...and 2 more figures