Table of Contents
Fetching ...

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler

TL;DR

This work tackles RGB-E object tracking by addressing how to extract robust event features from sparse event data and fuse them with RGB appearance. The authors introduce TENet, consisting of a pooling-based event backbone (Pooler) with Multi-Scale Pooling and a Mutually-Guided Fusion module for cross-modal interaction, followed by relation modelling and a tracking head. Empirical results on VisEvent and COESOT demonstrate state-of-the-art performance, including a 4.9 percentage-point PR and 5.2 percentage-point SR improvement on COESOT, and faster inference than competing RGB-E trackers. The findings validate modality-specific event processing and cross-modal attention as effective strategies for robust, real-time multimodal tracking in challenging illumination and motion conditions.

Abstract

There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

TL;DR

This work tackles RGB-E object tracking by addressing how to extract robust event features from sparse event data and fuse them with RGB appearance. The authors introduce TENet, consisting of a pooling-based event backbone (Pooler) with Multi-Scale Pooling and a Mutually-Guided Fusion module for cross-modal interaction, followed by relation modelling and a tracking head. Empirical results on VisEvent and COESOT demonstrate state-of-the-art performance, including a 4.9 percentage-point PR and 5.2 percentage-point SR improvement on COESOT, and faster inference than competing RGB-E trackers. The findings validate modality-specific event processing and cross-modal attention as effective strategies for robust, real-time multimodal tracking in challenging illumination and motion conditions.

Abstract

There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.
Paper Structure (15 sections, 6 equations, 9 figures, 6 tables)

This paper contains 15 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparisons of the proposed Pooler with CNN and Transformer methods. The CNN-based method uses parallel branches to extract the template and search features. The Transformer-based method utilises a unified attention block to perform relation modelling of the template and search tokens. Our pipeline takes into account the sparsity property of event images. "Event-only" and "RGB-only" signify that the input contains only one specific modality.
  • Figure 2: (a) Overview of our TENet, The 3 Stages of the Event branch collectively constitute the designed event backbone "Pooler".
  • Figure 3: Details of Multi-Scale Pooling module.
  • Figure 4: Visualisation of the output features of different groups in the Multi-Scale Pooling during Stage 2. Group $n$ indicates that the $n$-th in the 4 divided groups. $n={1,2,3,4}$. (b) and (e) Visualisation of the aggregation features from Stage 2 and Stage 3. (c) and (f) Visualisation of the final output features from Stage 2 and Stage 3.
  • Figure 5: Details of the proposed Mutually-Guided Fusion (MGF) module. The input to MGF is (i) or (ii). (i) denotes the augmentation of the Event features through the inclusion of RGB-related features. (ii) denotes the augmentation of RGB features with the inclusion of the Event-related features.
  • ...and 4 more figures