Learning Progressive Adaptation for Multi-Modal Tracking

He Wang; Tianyang Xu; Zhangyong Tang; Xiao-Jun Wu; Josef Kittler

Learning Progressive Adaptation for Multi-Modal Tracking

He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler

Abstract

Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.

Learning Progressive Adaptation for Multi-Modal Tracking

Abstract

Paper Structure (19 sections, 17 equations, 7 figures, 7 tables)

This paper contains 19 sections, 17 equations, 7 figures, 7 tables.

Introduction
Related Work
Visual Object Tracking
Multi-modal Tracking
Parameter-efficient Fine-tuning in Multi-Modal Tracking
Methodology
RGB Base Model
Multi-modal Tracking
Modality-Dependent Adaptation
Cross-modality Entangled Adaptation
Head Adaptation
Experiments
Experimental Setups
Datasets and Evaluation Metrics
Comparison with State-of-the-art Approaches
...and 4 more sections

Figures (7)

Figure 1: A comparison between the original fusion mechanisms and our proposed fusion mechanism: (a) The asymmetric structure reliant on the dominant modality, employing the prompt fine-tuning paradigm; (b) The symmetric structure that considers the dominant and auxiliary modalities as equal, emphasizing the complementary information across modalities; and (c) our proposed method (PATrack) incorporates progressive adaptation learning at three levels: intra-modality, inter-modality, and task level. X represents thermal, event, or depth input.
Figure 2: Illustration of the proposed PATrack. In the Modality Dependent Adaptation (MDA) module, only one modality can be sent to this architecture. The term “share” indicates that the RGB and X branches share parameters.
Figure 3: Attribute-based Precision Rate on LasHeR dataset.
Figure 4: Exploration analysis of the informativeness of RGB, thermal, event, and depth modalities using single-modality information entropy. Single-modality information entropy refers to the entropy value calculated for each individual modality.
Figure 5: Illustration of feature map comparison between BAT and our fusion mechanism and score maps comparison between before Head Adaptation and after Head Adaptation. 'after MDA' and ‘after CEA' represent the feature outputs of our proposed Modality Dependent Adapter and Cross-modality Entangled Adapter, respectively. (a) constitutes the visualization from LasHeR dataset (RGB-T). (b) corresponds to the visualization from VisEvent dataset (RGB-E), and (c) represents visualization from DepthTrack dataset (RGB-D). In every visualization group of before HA/after HA, the top row shows the feature visualization without going through Head Adaptation, and the bottom row presents the visualization after going through Head Adaptation.
...and 2 more figures

Learning Progressive Adaptation for Multi-Modal Tracking

Abstract

Learning Progressive Adaptation for Multi-Modal Tracking

Authors

Abstract

Table of Contents

Figures (7)