Table of Contents
Fetching ...

Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric

Chuanming Tang, Xiao Wang, Ju Huang, Bo Jiang, Lin Zhu, Jianlin Zhang, Yaowei Wang, Yonghong Tian

TL;DR

This work tackles color-event object tracking by proposing CEUTrack, a unified single-stage Transformer backbone that directly processes cropped color and event voxels without multi-branch fusion. It voxelizes event streams, projects four input streams into tokens, and uses adapters within a ViT-based backbone to achieve high efficiency (≈75 FPS) and SOTA accuracy. To advance the field, the authors introduce COESOT, a large-scale color-event tracking dataset with 90 categories, 1,354 sequences, and 17 attributes, along with a novel BreakOut Capacity (BOC) metric that weights harder videos more heavily. Experimental results demonstrate strong performance across COESOT, VisEvent, and FE108, with thorough ablations, visualizations, and failure analyses supporting the method’s robustness and practical value. The contribution includes not only a high-performance tracker but also a valuable dataset and evaluation toolkit to foster further research in color-event tracking.

Abstract

Combining the Color and Event cameras (also called Dynamic Vision Sensors, DVS) for robust object tracking is a newly emerging research topic in recent years. Existing color-event tracking framework usually contains multiple scattered modules which may lead to low efficiency and high computational complexity, including feature extraction, fusion, matching, interactive learning, etc. In this paper, we propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack), which achieves the above functions simultaneously. Given the event points and RGB frames, we first transform the points into voxels and crop the template and search regions for both modalities, respectively. Then, these regions are projected into tokens and parallelly fed into the unified Transformer backbone network. The output features will be fed into a tracking head for target object localization. Our proposed CEUTrack is simple, effective, and efficient, which achieves over 75 FPS and new SOTA performance. To better validate the effectiveness of our model and address the data deficiency of this task, we also propose a generic and large-scale benchmark dataset for color-event tracking, termed COESOT, which contains 90 categories and 1354 video sequences. Additionally, a new evaluation metric named BOC is proposed in our evaluation toolkit to evaluate the prominence with respect to the baseline methods. We hope the newly proposed method, dataset, and evaluation metric provide a better platform for color-event-based tracking. The dataset, toolkit, and source code will be released on: \url{https://github.com/Event-AHU/COESOT}.

Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric

TL;DR

This work tackles color-event object tracking by proposing CEUTrack, a unified single-stage Transformer backbone that directly processes cropped color and event voxels without multi-branch fusion. It voxelizes event streams, projects four input streams into tokens, and uses adapters within a ViT-based backbone to achieve high efficiency (≈75 FPS) and SOTA accuracy. To advance the field, the authors introduce COESOT, a large-scale color-event tracking dataset with 90 categories, 1,354 sequences, and 17 attributes, along with a novel BreakOut Capacity (BOC) metric that weights harder videos more heavily. Experimental results demonstrate strong performance across COESOT, VisEvent, and FE108, with thorough ablations, visualizations, and failure analyses supporting the method’s robustness and practical value. The contribution includes not only a high-performance tracker but also a valuable dataset and evaluation toolkit to foster further research in color-event tracking.

Abstract

Combining the Color and Event cameras (also called Dynamic Vision Sensors, DVS) for robust object tracking is a newly emerging research topic in recent years. Existing color-event tracking framework usually contains multiple scattered modules which may lead to low efficiency and high computational complexity, including feature extraction, fusion, matching, interactive learning, etc. In this paper, we propose a single-stage backbone network for Color-Event Unified Tracking (CEUTrack), which achieves the above functions simultaneously. Given the event points and RGB frames, we first transform the points into voxels and crop the template and search regions for both modalities, respectively. Then, these regions are projected into tokens and parallelly fed into the unified Transformer backbone network. The output features will be fed into a tracking head for target object localization. Our proposed CEUTrack is simple, effective, and efficient, which achieves over 75 FPS and new SOTA performance. To better validate the effectiveness of our model and address the data deficiency of this task, we also propose a generic and large-scale benchmark dataset for color-event tracking, termed COESOT, which contains 90 categories and 1354 video sequences. Additionally, a new evaluation metric named BOC is proposed in our evaluation toolkit to evaluate the prominence with respect to the baseline methods. We hope the newly proposed method, dataset, and evaluation metric provide a better platform for color-event-based tracking. The dataset, toolkit, and source code will be released on: \url{https://github.com/Event-AHU/COESOT}.
Paper Structure (21 sections, 3 equations, 11 figures, 9 tables)

This paper contains 21 sections, 3 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: (Left). Comparison of different multi-modal tracking frameworks, including early fusion (EF), middle feature fusion (MF), and our proposed unified tracking framework. (Right). Comparison of existing color-event tracking datasets. The circle size is proportional to the total frame numbers of the dataset. Best viewed in color.
  • Figure 2: The overview of our proposed Color-Event Unified Tracking framework CEUTrack. It simplifies the multi-branch multi-modal tracking framework based on the idea of one branch backbone for all, which gets rid of cumbersome modules, like multi-stream feature extraction, fusion and correlation, and multi-stage steps.
  • Figure 3: Details of attributes, category, and distribution of COESOT dataset.
  • Figure 4: Some representative examples from our proposed COESOT test set.
  • Figure 5: BOC scores comparison of baseline trackers on the COESOT dataset.
  • ...and 6 more figures