Table of Contents
Fetching ...

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Hongze Sun, Rui Liu, Wuque Cai, Jun Wang, Yue Wang, Huajin Tang, Yan Cui, Dezhong Yao, Daqing Guo

TL;DR

This study proposes a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking and proposes an enhanced transformer-based module to fuse multimodal features using attention mechanisms.

Abstract

Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches usually integrate multimodal inputs through adaptive local feature interactions, which cannot leverage the full potential of visual cues, thus resulting in insufficient feature modeling. In this study, we propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities and then uses a unified encoder to align the features across different domains. Moreover, we propose an enhanced transformer-based module to fuse multimodal features using attention mechanisms. With these methods, the MMHT model can effectively construct a multiscale and multidimensional visual feature space and achieve discriminative feature modeling. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with that of other state-of-the-art methods. Overall, our results highlight the effectiveness of the MMHT model in terms of addressing the challenges faced in visual object tracking tasks.

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

TL;DR

This study proposes a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking and proposes an enhanced transformer-based module to fuse multimodal features using attention mechanisms.

Abstract

Visual object tracking, which is primarily based on visible light image sequences, encounters numerous challenges in complicated scenarios, such as low light conditions, high dynamic ranges, and background clutter. To address these challenges, incorporating the advantages of multiple visual modalities is a promising solution for achieving reliable object tracking. However, the existing approaches usually integrate multimodal inputs through adaptive local feature interactions, which cannot leverage the full potential of visual cues, thus resulting in insufficient feature modeling. In this study, we propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking. The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities and then uses a unified encoder to align the features across different domains. Moreover, we propose an enhanced transformer-based module to fuse multimodal features using attention mechanisms. With these methods, the MMHT model can effectively construct a multiscale and multidimensional visual feature space and achieve discriminative feature modeling. Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with that of other state-of-the-art methods. Overall, our results highlight the effectiveness of the MMHT model in terms of addressing the challenges faced in visual object tracking tasks.
Paper Structure (28 sections, 28 equations, 7 figures, 9 tables, 3 algorithms)

This paper contains 28 sections, 28 equations, 7 figures, 9 tables, 3 algorithms.

Figures (7)

  • Figure 1: (a) Complementary characteristics of frame- and event-based images. Event-based cameras excel in challenging conditions, such as environments with high dynamic ranges and low light, while frame-based cameras enable the capture of rich detailed information. (b) Schematic of the imaging principle. Frame-based cameras synchronously record light intensity, while event-based cameras utilize ON/OFF spike trains to asynchronously reflect light intensity changes. Additionally, event-based cameras are compatible with higher dynamic ranges than those of frame-based cameras.
  • Figure 2: The overall framework of the proposed MMHT. Frame- and event-modality inputs are initially processed by hybrid backbones to extract discriminative features. These features are subsequently embedded as patch embeddings using the multimodal feature embedding module, enabling effective cross-modal visual cue alignment. The proposed transformer-based multimodal feature fusion blocks leverage diverse attention modules to enhance and seamlessly integrate cross-domain features. Ultimately, the multimodal feature decoder produces fusion-level inputs, which are employed by our heads to perform accurate object tracking.
  • Figure 3: The video length distribution across the datasets, with a histogram interval of 50 frames and an upper bound of 3000 frames in the statistics.
  • Figure 4: The precision and success curves yielded by trackers trained with different modalities.
  • Figure 5: Visualization of the results produced by trackers trained using diverse modalities. (a) Tracking results obtained from trackers trained with various modalities. The predicted bounding boxes generated by the trackers are visually compared with the ground truth bounding boxes of the input images obtained from two modalities. (b) Corresponding response maps of different trackers. The response intensity progresses from green to red, indicating an increasing response level.
  • ...and 2 more figures