Table of Contents
Fetching ...

Scene Adaptive Sparse Transformer for Event-based Object Detection

Yansong Peng, Hebei Li, Yueyi Zhang, Xiaoyan Sun, Feng Wu

TL;DR

This work tackles the high computational cost of Transformer-based event detection by introducing SAST, a scene-adaptive sparse Transformer that performs window-token co-sparsification and dynamic sparsity optimization. It combines a scoring module, a selection module, and Masked Sparse Window Self-Attention to enable efficient, scene-aware processing of sparse event streams. Empirical results on 1Mpx and Gen1 show SAST achieves state-of-the-art mAP with significantly reduced A-FLOPs and runtime, outperforming both dense and prior sparse networks, with further gains from the SAST-CB variant. The approach offers practical benefits for real-time, energy-efficient event-based detection across varying scenes and resolutions.

Abstract

While recent Transformer-based approaches have shown impressive performances on event-based object detection tasks, their high computational costs still diminish the low power consumption advantage of event cameras. Image-based works attempt to reduce these costs by introducing sparse Transformers. However, they display inadequate sparsity and adaptability when applied to event-based object detection, since these approaches cannot balance the fine granularity of token-level sparsification and the efficiency of window-based Transformers, leading to reduced performance and efficiency. Furthermore, they lack scene-specific sparsity optimization, resulting in information loss and a lower recall rate. To overcome these limitations, we propose the Scene Adaptive Sparse Transformer (SAST). SAST enables window-token co-sparsification, significantly enhancing fault tolerance and reducing computational overhead. Leveraging the innovative scoring and selection modules, along with the Masked Sparse Window Self-Attention, SAST showcases remarkable scene-aware adaptability: It focuses only on important objects and dynamically optimizes sparsity level according to scene complexity, maintaining a remarkable balance between performance and computational cost. The evaluation results show that SAST outperforms all other dense and sparse networks in both performance and efficiency on two large-scale event-based object detection datasets (1Mpx and Gen1). Code: https://github.com/Peterande/SAST

Scene Adaptive Sparse Transformer for Event-based Object Detection

TL;DR

This work tackles the high computational cost of Transformer-based event detection by introducing SAST, a scene-adaptive sparse Transformer that performs window-token co-sparsification and dynamic sparsity optimization. It combines a scoring module, a selection module, and Masked Sparse Window Self-Attention to enable efficient, scene-aware processing of sparse event streams. Empirical results on 1Mpx and Gen1 show SAST achieves state-of-the-art mAP with significantly reduced A-FLOPs and runtime, outperforming both dense and prior sparse networks, with further gains from the SAST-CB variant. The approach offers practical benefits for real-time, energy-efficient event-based detection across varying scenes and resolutions.

Abstract

While recent Transformer-based approaches have shown impressive performances on event-based object detection tasks, their high computational costs still diminish the low power consumption advantage of event cameras. Image-based works attempt to reduce these costs by introducing sparse Transformers. However, they display inadequate sparsity and adaptability when applied to event-based object detection, since these approaches cannot balance the fine granularity of token-level sparsification and the efficiency of window-based Transformers, leading to reduced performance and efficiency. Furthermore, they lack scene-specific sparsity optimization, resulting in information loss and a lower recall rate. To overcome these limitations, we propose the Scene Adaptive Sparse Transformer (SAST). SAST enables window-token co-sparsification, significantly enhancing fault tolerance and reducing computational overhead. Leveraging the innovative scoring and selection modules, along with the Masked Sparse Window Self-Attention, SAST showcases remarkable scene-aware adaptability: It focuses only on important objects and dynamically optimizes sparsity level according to scene complexity, maintaining a remarkable balance between performance and computational cost. The evaluation results show that SAST outperforms all other dense and sparse networks in both performance and efficiency on two large-scale event-based object detection datasets (1Mpx and Gen1). Code: https://github.com/Peterande/SAST
Paper Structure (22 sections, 6 equations, 10 figures, 8 tables)

This paper contains 22 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Detection performance vs. computational cost on 1Mpx, with marker size indicating model size. SAST exhibits superiority by employing window-token co-sparsification and scene-specific sparsity optimization, maintaining low computation while delivering high performance across varying scenes.
  • Figure 1: Setting different hyper-parameters results in different sparsity levels and performance of SAST and SAST-CB on 1Mpx. Performance does not continuously improve with increasing sparsity levels. SAST and SAST-CB each have their advantages in sparser and denser settings.
  • Figure 2: (a) The hierarchical architecture of SAST. Four SAST blocks extract multi-scale features from sparse tokens transformed from events. (b) The architecture of SAST block, which contains two successive SAST layers. (c) The architecture of SAST layer. In an SAST layer, tokens are partitioned into windows and scored by the scoring module first. The selection module selects important windows and tokens. Then, the selected tokens within selected windows are sequentially processed through MS-WSA, a sparse MLP layer, and an optional CB operation. Finally, processed tokens are scattered back and reversed from windows. Norm layers are omitted for simplification.
  • Figure 2: Additional Visualizations of original events, score heatmaps, and selection results under four scenes in Gen1. As the network progresses through subsequent SAST blocks, featuring multiple downsampling stages, the scale (receptive field) of tokens expands.
  • Figure 3: The architecture of scoring and selection modules. (a) The scoring module scores each window and token. The scoring process is regulated based on event sparsity, with windows and tokens competing to limit selections. (b) The selection module uses two filters to select important windows and tokens sequentially based on their scores.
  • ...and 5 more figures