Table of Contents
Fetching ...

Transtreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception

Xiang Zhang, Yufei Cui, Chenchen Fu, Weiwei Wu, Zihao Wang, Yuyang Sun, Xue Liu

TL;DR

The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computational delays.

Abstract

Real-time object detection is critical for the decision-making process for many real-world applications, such as collision avoidance and path planning in autonomous driving. This work presents an innovative real-time streaming perception method, Transtreaming, which addresses the challenge of real-time object detection with dynamic computational delay. The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computation delays. The proposed model outperforms the existing state-of-the-art methods, even in single-frame detection scenarios, by leveraging a transformer-based methodology. It demonstrates robust performance across a range of devices, from powerful V100 to modest 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, Transtreaming meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving.

Transtreaming: Adaptive Delay-aware Transformer for Real-time Streaming Perception

TL;DR

The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computational delays.

Abstract

Real-time object detection is critical for the decision-making process for many real-world applications, such as collision avoidance and path planning in autonomous driving. This work presents an innovative real-time streaming perception method, Transtreaming, which addresses the challenge of real-time object detection with dynamic computational delay. The core innovation of Transtreaming lies in its adaptive delay-aware transformer, which can concurrently predict multiple future frames and select the output that best matches the real-world present time, compensating for any system-induced computation delays. The proposed model outperforms the existing state-of-the-art methods, even in single-frame detection scenarios, by leveraging a transformer-based methodology. It demonstrates robust performance across a range of devices, from powerful V100 to modest 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, Transtreaming meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system's adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving.
Paper Structure (13 sections, 1 equation, 4 figures, 4 tables)

This paper contains 13 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Fluctuation of the computational delay for the same detector dealing with different frames with varying workloads in the server with NVIDIA GeForce RTX 4080.
  • Figure 2: Observation example: when a detector is trained with fixed vehicle speed ($30km/h$) and fixed computational delay ($\Delta t_1$), the detection is always accurate when both of the speed and delay remain the same. But the detection accuracy will significantly decrease when either (a) the vehicle speed changes (from $30km/h$ to $60km/h$) or (b) the computational delay varies (from $\Delta t_1$ to $\Delta t_2$).
  • Figure 3: Overall architecture of Transtreaming. Transtreaming composes of a detection model Transtreamer and strategy algorithm Adaptive Strategy. The detection process is mainly done by Transtreamer in the lower part of the figure. Adaptive Strategy provides support by producing temporal proposals that encoded by RTPE (Relative Temporal Positional Embedding) and by using 2 buffers to store historical features and dispatch detection results.
  • Figure 4: Detailed architecture of the detection model: Transtreamer. The example illustrates the situation that $T-3$ frame is skipped due to computational delay, and Adaptive strategy asks Transtreamer to produce the detection result of $T+1$, $T+3$ from $T+0$ image and $T-1$, $T-2$, $T-4$ buffered features.