TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking
Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler
TL;DR
This work tackles RGB-E object tracking by addressing how to extract robust event features from sparse event data and fuse them with RGB appearance. The authors introduce TENet, consisting of a pooling-based event backbone (Pooler) with Multi-Scale Pooling and a Mutually-Guided Fusion module for cross-modal interaction, followed by relation modelling and a tracking head. Empirical results on VisEvent and COESOT demonstrate state-of-the-art performance, including a 4.9 percentage-point PR and 5.2 percentage-point SR improvement on COESOT, and faster inference than competing RGB-E trackers. The findings validate modality-specific event processing and cross-modal attention as effective strategies for robust, real-time multimodal tracking in challenging illumination and motion conditions.
Abstract
There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet.
