Table of Contents
Fetching ...

eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking

Yucheng Chen, Lin Wang

TL;DR

This letter proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions and introduces a contrastive relation modeling (CRM) module to emphasize target information by leveraging a contrastive learning strategy between the target template and search regions.

Abstract

The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, insufficient interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features and strengthen the target information by improving the interaction between the target template and search regions. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Assembling to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that prompt-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to emphasize target information by leveraging a contrastive learning strategy between the target template and search regions. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.

eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking

TL;DR

This letter proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions and introduces a contrastive relation modeling (CRM) module to emphasize target information by leveraging a contrastive learning strategy between the target template and search regions.

Abstract

The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, insufficient interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features and strengthen the target information by improving the interaction between the target template and search regions. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Assembling to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that prompt-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to emphasize target information by leveraging a contrastive learning strategy between the target template and search regions. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.
Paper Structure (26 sections, 6 equations, 12 figures, 8 tables)

This paper contains 26 sections, 6 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: An illustration of the core idea of the environmental MoE (eMoE) module. This module acts as a subtle router to prompt-tune the frozen backbone encoder. The number of experts is determined by the attributes we decouple for the environmental conditions, and each expert is responsible for learning the attribute-specific features. All the learned features are assembled and added with the outputs from the backbone encoder at the corresponding layer for robust tracking representation.
  • Figure 2: Overview of our proposed framework. The input of the whole network is the patch embeddings of RGB frames and stacked event frames. The concatenated two modal patches are fed into the backbone model and eMoE, and eMoE is inserted into the $l$-th layer of ViT to generate feature tokens which are combined with the tokens from the ViT encoder at the corresponding layer. $E^{l}$ is the ViT encoder at layer $l$. The CRM module gets the enhanced tokens to further improve the discriminability of target object.
  • Figure 3: An illustration of our eMoE module. Here we take four expert branches as illustrations. The RGB and event tokens are fed into eMoE, which decouples the challenging attributes and generates attribute-specific representations under corresponding challenging conditions. Meanwhile, it is also responsible for dynamically weighing and assembling all the attribute-specific features to form a more discriminative representation for tracking.
  • Figure 4: An illustration of our CRM module. The RGB and event tokens are first fused into fused search region feature tokens and target template feature tokens. We exploit the contrastive learning strategy to pull the target information in search tokens near template feature tokens while push background information away from template feature tokens. The final goal is to make the tracking features more discriminative and unambiguous.
  • Figure 5: Visualization of attention maps from the backbone network compared with our eMoE-Tracker. Four challenging conditions including illumination variance, motion blur, occlusion and scale variance are selected to reflect the effectiveness of environmental attributes disentanglement and feature assembling.
  • ...and 7 more figures