Table of Contents
Fetching ...

Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking

Pengcheng Shao, Tianyang Xu, Xuefeng Zhu, Xiaojun Wu, Josef Kittler

TL;DR

A dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues, which outperforms existing state-of-the-art methods on the FE240 and COESOT datasets.

Abstract

Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event stream.Moreover, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.

Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking

TL;DR

A dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues, which outperforms existing state-of-the-art methods on the FE240 and COESOT datasets.

Abstract

Event-based bionic camera asynchronously captures dynamic scenes with high temporal resolution and high dynamic range, offering potential for the integration of events and RGB under conditions of illumination degradation and fast motion. Existing RGB-E tracking methods model event characteristics utilising attention mechanism of Transformer before integrating both modalities. Nevertheless, these methods involve aggregating the event stream into a single event frame, lacking the utilisation of the temporal information inherent in the event stream.Moreover, the traditional attention mechanism is well-suited for dense semantic features, while the attention mechanism for sparse event features require revolution. In this paper, we propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters, aiming to capture spatio-temporal features that contain motion cues. Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions. The experimental results indicate that our method outperforms existing state-of-the-art methods on the FE240 and COESOT datasets, providing an effective processing manner for the event data.
Paper Structure (15 sections, 8 equations, 8 figures, 4 tables)

This paper contains 15 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Different splitting methods for event stream. Blue points indicate that light intensity enhancement events occurred at that pixel at that moment. Red points indicate that light intensity reduction events occurred at that pixel at that moment. (a) All events in event stream are stacked together according to polarity to form a single event frame. (b) The whole event stream is divided into n smaller event streams, events in n event streams are stacked together according to polarity to form multiple event subframes.
  • Figure 2: The overall architecture of DS-MESA. The event stream is discretised into n clusters along the temporal dimension, subsequently forming multiple event subframes. "STME" stands for spatio-temporal motion entanglement module. "MGF" stands for mutually guided fusion. The MGF employs cross-attention mechanism within both RGB and event modalities. It comprises two layers of Transformer encoders from the Vision Transformer (ViT). One layer is dedicated to fusing template features across two modalities, while the other focuses on integrating search features across two modalities. "Relation Modelling Block" consists of N layers of Transformer encoders from ViT.
  • Figure 3: Single event frame vs Multiple event subframes. (a) All events within a time interval are aggregated into a single event frame. With drastic shifts in lighting conditions, the target is almost invisible in the single event frame. Conversely, for instances (b), (c) and (d), multiple event subframes successfully elucidate the target and its motion trajectory.
  • Figure 4: A detailed architectures of the proposed STME. The red dashed boxes denote the sparsified event attention matrices. $e_{t-1}$ is derived from the features produced by the previous event frame from either stage 2 or stage 3. $e_t$ is derived from the features produced by the previous event frame from either stage 2 or stage 3.
  • Figure 5: The detailed process of event-based sparse attention (ESA) operation. The obtained event attention matrix is sparsified four times and the final sparse attention matrix is acquired by adding the matrices from the four sparsifications.
  • ...and 3 more figures