Table of Contents
Fetching ...

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfi, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

TL;DR

The paper tackles the challenge of leveraging multimodal signals for video object tracking when data from all modalities are not available at inference. It introduces Mixture of Modal Experts (MeME), a soft router-based framework that enables cross-modal knowledge transfer by combining modality-specific and shared experts, supervised by expert-balance and modality-classification losses. Efficient reasoning is achieved through low-dimensional projections and an edge-prior guided shared path, complemented by Modal Prompting to condition RGB features on auxiliary modalities. Trained on paired RGB-X data (RGB-E, RGB-D, RGB-T) and evaluated on DepthTrack, LasHeR, VisEvent, and RGBT234, XTrack achieves state-of-the-art results and demonstrates robust cross-domain generalization, including zero-shot transfer to unseen modalities. The work establishes a practical path toward generalist multimodal trackers that improve single-modality inference through multimodal training.

Abstract

Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a ``weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier ``fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which we show to benefit from the multimodal knowledge available during training, thanks to the proposed method. Through the exhaustive experiments that use only paired RGB-E, RGB-D, and RGB-T during training, we showcase the benefit of the proposed method for RGB-X tracker during inference, with an average +3\% precision improvement over the current SOTA. Our source code is publicly available at https://github.com/supertyd/XTrack/tree/main.

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

TL;DR

The paper tackles the challenge of leveraging multimodal signals for video object tracking when data from all modalities are not available at inference. It introduces Mixture of Modal Experts (MeME), a soft router-based framework that enables cross-modal knowledge transfer by combining modality-specific and shared experts, supervised by expert-balance and modality-classification losses. Efficient reasoning is achieved through low-dimensional projections and an edge-prior guided shared path, complemented by Modal Prompting to condition RGB features on auxiliary modalities. Trained on paired RGB-X data (RGB-E, RGB-D, RGB-T) and evaluated on DepthTrack, LasHeR, VisEvent, and RGBT234, XTrack achieves state-of-the-art results and demonstrates robust cross-domain generalization, including zero-shot transfer to unseen modalities. The work establishes a practical path toward generalist multimodal trackers that improve single-modality inference through multimodal training.

Abstract

Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a ``weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier ``fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which we show to benefit from the multimodal knowledge available during training, thanks to the proposed method. Through the exhaustive experiments that use only paired RGB-E, RGB-D, and RGB-T during training, we showcase the benefit of the proposed method for RGB-X tracker during inference, with an average +3\% precision improvement over the current SOTA. Our source code is publicly available at https://github.com/supertyd/XTrack/tree/main.
Paper Structure (13 sections, 12 equations, 5 figures, 7 tables)

This paper contains 13 sections, 12 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Motivation: (a) Existing tracking methods typically address each modality in isolation, tackling one appearance-related challenge at a time. This is mainly due to cross-modal domain gaps and the lack of a comprehensive multimodal dataset. Consequently, only modality-specific branches are activated during inference based on a priori knowledge of the input type, limiting the potential for cross-modal integration. (b) We make this cross-modal "impossible triangle" possible by decomposing the knowledge from each modality into transferable attributes, each capturing distinct environmental aspects, which is achieved for the first time for an RGB-X tracker.
  • Figure 2: Joint Benefits: Despite domain gaps, some samples across modalities share similar attributes, creating overlap in representation, as shown in the left. This overlap complicates strict modality classification, introducing ambiguity. For example, on the Event benchmark, SOTA methods like ViPT vipt perform worse when trained on multiple modalities than on Events alone. In contrast, we view this ambiguity as a chance for cross-modal knowledge sharing. Our approach leverages this potential, enabling effective multimodal training and consistent improvement.
  • Figure 3: Mixture of Modal Experts (MeME): MoE has demonstrated significant advancements recently. Recent work dai2024deepseekmoe has further leveraged shared experts to reduce redundancy. However, it remains unclear which expert learns what, due to the implicit learning setting. In contrast, we make the learning protocol explicit by assigning specific experts to particular modal inputs. Additionally, we enhance the shared expert model with inductive edge bias, increasing efficiency when dealing with relatively limited downstream data.
  • Figure 4: Details on experts and prompts.
  • Figure 5: Routing Choice. We present the top-k ($k=2$) decisions for activating the Depth, Event, and Thermal experts, using RGB-Event as input during inference. Our model dynamically selects the most suitable experts for different challenging scenarios, ensuring optimal object tracking performance despite appearance changes across diverse scenes.