Table of Contents
Fetching ...

SSGA-Net: Stepwise Spatial Global-local Aggregation Networks for for Autonomous Driving

Yiming Cui, Cheng Han, Dongfang Liu

TL;DR

This work targets online video object detection for autonomous driving, where feature degradation and the need for real-time inference hinder performance. It introduces a stepwise spatial global-local aggregation network that progressively refines object predictions using a set of neighboring frames $\mathcal{N}(\mathbf{I})$ of size $l$, while fusing global semantics from the current frame with local details from neighbors. Central contributions include a multi-stage stepwise refinement, a spatial global-local fusion module, and a dynamic aggregation strategy that stops when refinements converge (cosine similarity threshold $\delta$). Empirically, the approach yields at least 1% mAP improvement on ImageNet VID and gains on car-driving datasets, with modest extra compute and strong reconfigurability, making it practical for online perception in autonomous driving.

Abstract

Visual-based perception is the key module for autonomous driving. Among those visual perception tasks, video object detection is a primary yet challenging one because of feature degradation caused by fast motion or multiple poses. Current models usually aggregate features from the neighboring frames to enhance the object representations for the task heads to generate more accurate predictions. Though getting better performance, these methods rely on the information from the future frames and suffer from high computational complexity. Meanwhile, the aggregation process is not reconfigurable during the inference time. These issues make most of the existing models infeasible for online applications. To solve these problems, we introduce a stepwise spatial global-local aggregation network. Our proposed models mainly contain three parts: 1). Multi-stage stepwise network gradually refines the predictions and object representations from the previous stage; 2). Spatial global-local aggregation fuses the local information from the neighboring frames and global semantics from the current frame to eliminate the feature degradation; 3). Dynamic aggregation strategy stops the aggregation process early based on the refinement results to remove redundancy and improve efficiency. Extensive experiments on the ImageNet VID benchmark validate the effectiveness and efficiency of our proposed models.

SSGA-Net: Stepwise Spatial Global-local Aggregation Networks for for Autonomous Driving

TL;DR

This work targets online video object detection for autonomous driving, where feature degradation and the need for real-time inference hinder performance. It introduces a stepwise spatial global-local aggregation network that progressively refines object predictions using a set of neighboring frames of size , while fusing global semantics from the current frame with local details from neighbors. Central contributions include a multi-stage stepwise refinement, a spatial global-local fusion module, and a dynamic aggregation strategy that stops when refinements converge (cosine similarity threshold ). Empirically, the approach yields at least 1% mAP improvement on ImageNet VID and gains on car-driving datasets, with modest extra compute and strong reconfigurability, making it practical for online perception in autonomous driving.

Abstract

Visual-based perception is the key module for autonomous driving. Among those visual perception tasks, video object detection is a primary yet challenging one because of feature degradation caused by fast motion or multiple poses. Current models usually aggregate features from the neighboring frames to enhance the object representations for the task heads to generate more accurate predictions. Though getting better performance, these methods rely on the information from the future frames and suffer from high computational complexity. Meanwhile, the aggregation process is not reconfigurable during the inference time. These issues make most of the existing models infeasible for online applications. To solve these problems, we introduce a stepwise spatial global-local aggregation network. Our proposed models mainly contain three parts: 1). Multi-stage stepwise network gradually refines the predictions and object representations from the previous stage; 2). Spatial global-local aggregation fuses the local information from the neighboring frames and global semantics from the current frame to eliminate the feature degradation; 3). Dynamic aggregation strategy stops the aggregation process early based on the refinement results to remove redundancy and improve efficiency. Extensive experiments on the ImageNet VID benchmark validate the effectiveness and efficiency of our proposed models.
Paper Structure (13 sections, 6 equations, 4 figures, 8 tables)

This paper contains 13 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of model frameworks between the existing feature-aggregation based models (a), which fuse the features at one time, and our proposed models (b), which gradually refine the results.
  • Figure 2: Framework of the multi-stage stepwise network. The bounding boxes in the current frame (marked as red boxes) are gradually refined according to the prediction results from the neighboring frames (marked as orange boxes) using the spatial global-local aggregation module.
  • Figure 3: Framework of spatial global-local aggregation.
  • Figure 4: Visualization examples of Deformable-DETR integrated with our proposed methods. The backbone is ResNet-50.