Table of Contents
Fetching ...

CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection

Xiang Zhang, Chenchen Fu, Yufei Cui, Lan Yi, Yuyang Sun, Weiwei Wu, Xue Liu

TL;DR

CorrDiff tackles real-time streaming perception under variable delays by integrating runtime temporal cues to predict multiple future frames and by scheduling model execution to align outputs with real-world timing. The method combines CDdetector (with Corr_Past and Diff_Now) and CDscheduler ( Planner plus buffers) to adapt to fluctuating communication and computation delays, while employing mixed-speed training to widen temporal perception. On Argoverse-HD, CorrDiff achieves state-of-the-art streaming performance, with sAP up to 38.1% and notable gains across devices and delay scenarios, demonstrating robustness and practical applicability for safety-critical systems. The work offers open-source code and highlights the importance of delay-aware, multi-frame forecasting for reliable real-time perception in autonomous platforms.

Abstract

Real-time object detection takes an essential part in the decision-making process of numerous real-world applications, including collision avoidance and path planning in autonomous driving systems. This paper presents a novel real-time streaming perception method named CorrDiff, designed to tackle the challenge of delays in real-time detection systems. The main contribution of CorrDiff lies in its adaptive delay-aware detector, which is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames, and selectively produce predictions that matches real-world time, effectively compensating for any communication and computational delays. The proposed model outperforms current state-of-the-art methods by leveraging motion estimation and feature enhancement, both for 1) single-frame detection for the current frame or the next frame, in terms of the metric mAP, and 2) the prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP metric is to evaluate object detection algorithms in streaming scenarios, factoring in both latency and accuracy). It demonstrates robust performance across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, CorrDiff meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving. Our code is completely open-sourced and is available at https://anonymous.4open.science/r/CorrDiff.

CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection

TL;DR

CorrDiff tackles real-time streaming perception under variable delays by integrating runtime temporal cues to predict multiple future frames and by scheduling model execution to align outputs with real-world timing. The method combines CDdetector (with Corr_Past and Diff_Now) and CDscheduler ( Planner plus buffers) to adapt to fluctuating communication and computation delays, while employing mixed-speed training to widen temporal perception. On Argoverse-HD, CorrDiff achieves state-of-the-art streaming performance, with sAP up to 38.1% and notable gains across devices and delay scenarios, demonstrating robustness and practical applicability for safety-critical systems. The work offers open-source code and highlights the importance of delay-aware, multi-frame forecasting for reliable real-time perception in autonomous platforms.

Abstract

Real-time object detection takes an essential part in the decision-making process of numerous real-world applications, including collision avoidance and path planning in autonomous driving systems. This paper presents a novel real-time streaming perception method named CorrDiff, designed to tackle the challenge of delays in real-time detection systems. The main contribution of CorrDiff lies in its adaptive delay-aware detector, which is able to utilize runtime-estimated temporal cues to predict objects' locations for multiple future frames, and selectively produce predictions that matches real-world time, effectively compensating for any communication and computational delays. The proposed model outperforms current state-of-the-art methods by leveraging motion estimation and feature enhancement, both for 1) single-frame detection for the current frame or the next frame, in terms of the metric mAP, and 2) the prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP metric is to evaluate object detection algorithms in streaming scenarios, factoring in both latency and accuracy). It demonstrates robust performance across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, CorrDiff meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving. Our code is completely open-sourced and is available at https://anonymous.4open.science/r/CorrDiff.
Paper Structure (15 sections, 7 equations, 7 figures, 6 tables)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Demonstration of object displacement error in real world systems with potential communication-computational delays.
  • Figure 2: One-frame inference delay of the DAMO-StreamNet DAMO-StreamNet in different scenarios: (a) deploying on different devices. (b) deploying on a server with RTX 4080 but with various workloads (simulated by using a similar approach to GPU contention generation in Approxdet). (c) deploying on a server with RTX 4080 but with different bandwidths.
  • Figure 3: Motivation example: when a streaming detector is trained with fixed communication-computational delay (e.g. $\Delta t$) and fixed object velocities (e.g. $30km/h$), the inference can be accurate when (a) both delay and velocity remain the same as in training. But the detection accuracy will significantly decrease when either (b) the communication-computational delay varies (e.g. from $\Delta t$ to $\Delta t'$) or (c) the vehicle velocity changes (e.g. from $30km/h$ to $60km/h$).
  • Figure 4: Overall architecture of CorrDiff. CorrDiff composes of a detection model CDdetector and a scheduling algorithm CDscheduler. CDdetector utilizes the Corr_Past module and the Diff_Now module, combining past and current features to produce future predictions. CDscheduler provides support by gathering runtime statistics to generate Temporal Cues, which is proceeded by CDdetector, making it adaptively delay-aware. The scheduler also uses 3 buffers: Historical Feature Buffer to reuse previously computed frame features, Corr_Past Buffer to reuse correlation results and Output Buffer to store the freshest predictions and dispatch detection results at the corresponding timestamp. F&C Buffer is the abbreviation for Historical Feature Buffer and Corr_Past Buffer.
  • Figure 5: Detailed architecture of the detection model CDdetector. The example illustrates the situation that $i-3$ frame is skipped due to communication or computational delay. The detector is required to generate predictions for frame at $i+1$, $i+3$, given the image at $i$ and buffered features at $i-1$, $i-2$, $i-4$.
  • ...and 2 more figures