Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Haitian Zhang; Xiangyuan Wang; Chang Xu; Xinya Wang; Fang Xu; Huai Yu; Lei Yu; Wen Yang

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

Haitian Zhang, Xiangyuan Wang, Chang Xu, Xinya Wang, Fang Xu, Huai Yu, Lei Yu, Wen Yang

TL;DR

FAOD tackles high-speed object detection by fusing Event cameras with RGB frames while addressing two mismatches: Event-RGB sampling differences and train–test frequency gaps. The method introduces an Align Module that uses AdaIN and deformable convolution to align cross-modal features and a Time Shift training strategy to align predictions from temporally shifted Event–RGB pairs with their original annotations, enabling Event-first predictions with RGB as supplementary information. A shallow EF Fusion with LSTM-based temporal blocks preserves low latency and achieves state-of-the-art results on PKU-DAVIS-SOD and DSEC-Detection, including robustness to 80x frequency mismatch. The work demonstrates that Event–RGB fusion with proper alignment and frequency adaptation generalizes better to high-frequency inference than Event-only approaches, and shows that shallow fusion is more efficient than deep fusion.

Abstract

Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments and the rich semantic information provided by RGB cameras. However, two critical mismatches: low-latency Events \textit{vs.}~high-latency RGB frames; temporally sparse labels in training \textit{vs.}~continuous flow in inference, significantly hinder the high-frequency fusion-based object detection. To address these challenges, we propose the \textbf{F}requency-\textbf{A}daptive Low-Latency \textbf{O}bject \textbf{D}etector (FAOD). FAOD aligns low-frequency RGB frames with high-frequency Events through an Align Module, which reinforces cross-modal style and spatial proximity to address the Event-RGB Mismatch. We further propose a training strategy, Time Shift, which enforces the module to align the prediction from temporally shifted Event-RGB pairs and their original representation, that is, consistent with Event-aligned annotations. This strategy enables the network to use high-frequency Event data as the primary reference while treating low-frequency RGB images as supplementary information, retaining the low-latency nature of the Event stream toward high-frequency detection. Furthermore, we observe that these corrected Event-RGB pairs demonstrate better generalization from low training frequency to higher inference frequencies compared to using Event data alone. Extensive experiments on the PKU-DAVIS-SOD and DSEC-Detection datasets demonstrate that our FAOD achieves SOTA performance. Specifically, in the PKU-DAVIS-SOD Dataset, FAOD achieves 9.8 points improvement in terms of the mAP in fully paired Event-RGB data with only a quarter of the parameters compared to SODFormer, and even maintains robust performance (only a 3 points drop in mAP) under 80$\times$ Event-RGB frequency mismatch.

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

TL;DR

Abstract

Frequency-Adaptive Low-Latency Object Detection Using Events and Frames

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)