Table of Contents
Fetching ...

Awesome Multi-modal Object Tracking

Chunhui Zhang, Li Liu, Hao Wen, Xi Zhou, Yanfeng Wang

TL;DR

This survey addresses the gap in comprehensive MMOT coverage by outlining five main multimodal tracking tasks that extend beyond RGB-only approaches. It synthesizes datasets and mainstream algorithms, including self-supervised learning, prompt learning, knowledge distillation, generative models, and state-space modeling, and provides a continuously updated GitHub resource for community collaboration. The work aims to accelerate progress toward universal multi-modal tracking and foundation-model-scale tracking systems with broader modality support. By clarifying taxonomy, datasets, and benchmarks, the paper facilitates cross-modal fusion research and practical deployment in challenging environments.

Abstract

Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, \ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (\eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at https://github.com/983632847/Awesome-Multimodal-Object-Tracking.

Awesome Multi-modal Object Tracking

TL;DR

This survey addresses the gap in comprehensive MMOT coverage by outlining five main multimodal tracking tasks that extend beyond RGB-only approaches. It synthesizes datasets and mainstream algorithms, including self-supervised learning, prompt learning, knowledge distillation, generative models, and state-space modeling, and provides a continuously updated GitHub resource for community collaboration. The work aims to accelerate progress toward universal multi-modal tracking and foundation-model-scale tracking systems with broader modality support. By clarifying taxonomy, datasets, and benchmarks, the paper facilitates cross-modal fusion research and practical deployment in challenging environments.

Abstract

Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, \eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (\eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (\eg WebUAV-3M) and vision-depth-language (\eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, \ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (\eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at https://github.com/983632847/Awesome-Multimodal-Object-Tracking.
Paper Structure (2 sections, 2 figures)

This paper contains 2 sections, 2 figures.

Figures (2)

  • Figure 1: Scope of MMOT.
  • Figure 2: Data samples of five main MMOT tasks: (a) RGBL tracking, (b) RGBE tracking, (c) RGBD tracking, (d) RGBT tracking, and (e) miscellaneous (RGB+X) tracking. The figures are borrowed from wang2021towardszhu2024crsotyan2021depthtrackli2019rgbzhang2022webuavzhu2024unimod1k, respectively.