Visual Object Tracking across Diverse Data Modalities: A Review
Mengmeng Wang, Teli Ma, Shuo Xin, Xiaojun Hou, Jiazheng Xing, Guang Dai, Jingdong Wang, Yong Liu
TL;DR
The paper surveys Visual Object Tracking (VOT) across RGB, thermal infrared, and LiDAR modalities, and four multi-modal combinations, with a focus on deep learning methods. It classifies single-modal RGB tracking into four paradigms (Discriminative Correlation Filters, Siamese, Instance Classification/Detection, and One-stream Transformers) and covers TIR and LiDAR variants, as well as multi-modal trackers (RGB-Depth, RGB-Thermal, RGB-LiDAR, RGB-Language) and their fusion strategies. It provides benchmark comparisons, datasets, and an analysis of trends (notably the rise of Transformer-based unified schemas and cross-modal fusion) along with recommendations for future work, including data-efficient training and long-term/multi-task learning. The findings highlight the practical potential of multi-modal VOT for robustness and accuracy, driven by Transformer architectures and cross-modal representations, while underscoring data and computational challenges that remain to be addressed.
Abstract
Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
