Table of Contents
Fetching ...

SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features

Song Wang, Zhu Wang, Can Li, Xiaojuan Qi, Hayden Kwok-Hay So

TL;DR

SpikeMOT addresses the challenge of multi-object tracking with event cameras by integrating a spiking neural network-based tracker with a frame-rate detector in a Siamese architecture. It leverages sparse spatiotemporal features from event voxels, powered by SRM neurons, to achieve high-frequency motion tracking while maintaining identities through a detector–tracker–matcher pipeline. The introduction of DSEC-MOT provides a realistic benchmark with severe occlusions and re-identification demands, enabling thorough evaluation with metrics like HOTA, IDF1, and CLEAR. Experimental results on DSEC-MOT and FE240hz show state-of-the-art tracking performance and solid robustness to background event noise, illustrating the practical impact of sparse, temporally-aware representations for event-based MOT.

Abstract

In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex background and camera motion can easily obscure the true target motion. In this work, an event-based multi-object tracker, called SpikeMOT, is presented to address these challenges. SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects. The resulting spike train representations are used to track the object movement at high frequency, while a simultaneous object detector provides updated spatial information of these objects at an equivalent frame rate. To evaluate the effectiveness of SpikeMOT, we introduce DSEC-MOT, the first large-scale event-based MOT benchmark incorporating fine-grained annotations for objects experiencing severe occlusions, frequent trajectory intersections, and long-term re-identification in real-world contexts. Extensive experiments employing DSEC-MOT and another event-based dataset, named FE240hz, demonstrate SpikeMOT's capability to achieve high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.

SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features

TL;DR

SpikeMOT addresses the challenge of multi-object tracking with event cameras by integrating a spiking neural network-based tracker with a frame-rate detector in a Siamese architecture. It leverages sparse spatiotemporal features from event voxels, powered by SRM neurons, to achieve high-frequency motion tracking while maintaining identities through a detector–tracker–matcher pipeline. The introduction of DSEC-MOT provides a realistic benchmark with severe occlusions and re-identification demands, enabling thorough evaluation with metrics like HOTA, IDF1, and CLEAR. Experimental results on DSEC-MOT and FE240hz show state-of-the-art tracking performance and solid robustness to background event noise, illustrating the practical impact of sparse, temporally-aware representations for event-based MOT.

Abstract

In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex background and camera motion can easily obscure the true target motion. In this work, an event-based multi-object tracker, called SpikeMOT, is presented to address these challenges. SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects. The resulting spike train representations are used to track the object movement at high frequency, while a simultaneous object detector provides updated spatial information of these objects at an equivalent frame rate. To evaluate the effectiveness of SpikeMOT, we introduce DSEC-MOT, the first large-scale event-based MOT benchmark incorporating fine-grained annotations for objects experiencing severe occlusions, frequent trajectory intersections, and long-term re-identification in real-world contexts. Extensive experiments employing DSEC-MOT and another event-based dataset, named FE240hz, demonstrate SpikeMOT's capability to achieve high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.
Paper Structure (34 sections, 7 equations, 10 figures, 7 tables)

This paper contains 34 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Event camera output. (a)-(d) Objects are captured under high contrast and low light conditions. (e)-(f) The same scene captured by an RGB camera and a moving event camera. (g) Events surrounding the car after background events are removed.
  • Figure 2: Overview of SpikeMOT. (a) Tracklets are updated every $\tau$ period by associating the latest tracked coordinates with the detected objects; (b) Tracker updates object coordinates every $\delta$ period leveraging the high temporal resolution of event cameras.
  • Figure 3: Siamese position estimator. (a) the architecture of the proposed Siamese network, with the template and the search branches incorporating two-stage SNN models. (b) The fundamental unit of the SNN model, encompassing the synapses represented by CNNs and the neurons modelled by SRMs. (c) Spiking neuron model, wherein the ascent of internal membrane potential culminates in the discharge of a spike once a threshold is reached. The emitted spike is then convolved by the refractory kernel to update the refractory signal, which in turn suppresses the subsequent response signal.
  • Figure 4: An overview of the scenes in each sequence of DSEC-MOT, which include cluttered background and heavy occlusion. The 1st and the 3rd rows show the RGB images of each sequence, while the 2nd and the 4th rows demonstrate the corresponding event-based representations.
  • Figure 5: Qualitative comparison of SpikeMOT with state-of-the-art trackers under two scenarios. The left scenario (the first 4 columns) involves a busy intersection where the car in the middle is occluded by pedestrians, while the right scenario (the following 3 columns) features multiple persons moving together. The first row shows the groundtruth bounding boxes of the event domain showcased in the image domain, and the subsequent rows exhibit the trackers’ performance. In the left scene of the second row, the GTR tracker lost Car 1, which swapped identity with Car 10. In the right scene, GTR correctly tracked Person 41, but lost track of Person 47. In the third row, SiamMOT successfully tracked Car 1 after it was occluded, but an incorrect identity was assigned to Person 121, leading to a swap with Person 135. The fourth row demonstrates that ByteTrack was unable to track Car 29, while accurately tracking Pedestrians 75 and 77 in the corresponding scene on the right. Similarly, in the fifth row, Trackformer could not track the designated vehicle, but it was successful in tracking the specified pedestrians in the corresponding scene on the right. Finally, in the sixth row, our SpikeMOT exhibited successful tracking of Car 1 that was occluded at the intersection, and accurately tracked the group of individuals moving together.
  • ...and 5 more figures