Table of Contents
Fetching ...

Towards General Multimodal Visual Tracking

Andong Lu, Mai Wen, Jinhu Wang, Yuanzhi Guo, Chenglong Li, Jin Tang, Bin Luo

TL;DR

This work introduces QuadTrack600, a first large-scale quad-modal visual tracking benchmark spanning RGB, thermal infrared, event data, and language, accompanied by a structured evaluation protocol across 21 challenge attributes. To tackle the fusion of four heterogeneous modalities, the authors propose QuadFusion, a Transformer-based tracker built on a ViT backbone and featuring a Multiscale Fusion Mamba that enables modal, region, and token-level interactions with linear complexity. Ablation studies and extensive experiments on QuadTrack600 and three bi-modal datasets (LasHeR, VisEvent, TNL2K) demonstrate that quad-modal fusion consistently outperforms unimodal and bi-modal baselines, validating both the benchmark's challenge and the architecture's effectiveness. The work establishes a new standard for general multimodal tracking and offers a scalable fusion mechanism that can be extended with additional modalities in future work, enabling robust tracking in diverse real-world scenarios.

Abstract

Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.

Towards General Multimodal Visual Tracking

TL;DR

This work introduces QuadTrack600, a first large-scale quad-modal visual tracking benchmark spanning RGB, thermal infrared, event data, and language, accompanied by a structured evaluation protocol across 21 challenge attributes. To tackle the fusion of four heterogeneous modalities, the authors propose QuadFusion, a Transformer-based tracker built on a ViT backbone and featuring a Multiscale Fusion Mamba that enables modal, region, and token-level interactions with linear complexity. Ablation studies and extensive experiments on QuadTrack600 and three bi-modal datasets (LasHeR, VisEvent, TNL2K) demonstrate that quad-modal fusion consistently outperforms unimodal and bi-modal baselines, validating both the benchmark's challenge and the architecture's effectiveness. The work establishes a new standard for general multimodal tracking and offers a scalable fusion mechanism that can be extended with additional modalities in future work, enabling robust tracking in diverse real-world scenarios.

Abstract

Existing multimodal tracking studies focus on bi-modal scenarios such as RGB-Thermal, RGB-Event, and RGB-Language. Although promising tracking performance is achieved through leveraging complementary cues from different sources, it remains challenging in complex scenes due to the limitations of bi-modal scenarios. In this work, we introduce a general multimodal visual tracking task that fully exploits the advantages of four modalities, including RGB, thermal infrared, event, and language, for robust tracking under challenging conditions. To provide a comprehensive evaluation platform for general multimodal visual tracking, we construct QuadTrack600, a large-scale, high-quality benchmark comprising 600 video sequences (totaling 384.7K high-resolution (640x480) frame groups). In each frame group, all four modalities are spatially aligned and meticulously annotated with bounding boxes, while 21 sequence-level challenge attributes are provided for detailed performance analysis. Despite quad-modal data provides richer information, the differences in information quantity among modalities and the computational burden from four modalities are two challenging issues in fusing four modalities. To handle these issues, we propose a novel approach called QuadFusion, which incorporates an efficient Multiscale Fusion Mamba with four different scanning scales to achieve sufficient interactions of the four modalities while overcoming the exponential computational burden, for general multimodal visual tracking. Extensive experiments on the QuadTrack600 dataset and three bi-modal tracking datasets, including LasHeR, VisEvent, and TNL2K, validate the effectiveness of our QuadFusion.

Paper Structure

This paper contains 27 sections, 5 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Some representative samples from QuadTrack600, with the challenge attributes of the sequence listed above the data, include partial occlusion (PO), camera motion (CM), background clutter (BC), similar appearance (SA), low illumination (LI), illumination variation (IV), scale variation (SV), background object motion (BOM), no motion (NM), overexposure (OE), low resolution (LR), fast motion (FM), and thermal crossover (TC). Representative samples of existing multimodal visual tracking failure scenarios are shown in (b), (c), and (d), respectively.
  • Figure 2: Workflow of data collection and data alignment.
  • Figure 3: Some analysis and statistics on the QuadTrack600.
  • Figure 4: The overall architecture of QuadFusion. First, three visual modalities and language input are embedded as tokens and processed through Transformer blocks for joint feature extraction and relationship modeling between the search and template images. In the proposed Multiscale Fusion Mamba (MFM) block, tokens from all four modalities are concatenated along the token dimension, enabling efficient multi-scale interactions across four scanning levels. Finally, the fused search-region tokens are passed to the tracking head to generate the final tracking prediction.
  • Figure 5: Performance comparison of QuadFusion against advanced trackers under different challenging attributes of QuadTrack600.
  • ...and 3 more figures