SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Xiaojun Hou; Jiazheng Xing; Yijie Qian; Yaowei Guo; Shuo Xin; Junhao Chen; Kai Tang; Mengmeng Wang; Zhengkai Jiang; Liang Liu; Yong Liu

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, Yong Liu

TL;DR

This work proposes a novel symmetric multimodal tracking framework called SDSTrack, which introduces lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner.

Abstract

Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at https://github.com/hoqolo/SDSTrack.

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 8 figures, 8 tables)

This paper contains 17 sections, 12 equations, 8 figures, 8 tables.

Introduction
Related Works
Multimodal object tracking
Parameter-efficient Fine-tuning for Vision Models
Method
Symmetric Multimodal Adaptation (SMA)
Cross Modal Adaptation
Multimodal Fusion Adaptation
Complementary Masked Patch Distillation
Random Complementary Patch Mask (RCPM)
Self-Distillation Learning (SD)
Prediction Head and Supervised Loss
Experiments
Implementation Details
Comparison with State-of-the-arts
...and 2 more sections

Figures (8)

Figure 1: Previous frameworks vs. SDSTrack. (a) Previous symmetric framework yan2021depthtrack has lots of training parameters and risk of overfitting. (b) Previous asymmetric framework zhu2023ViPT regards RGB as the primary modality and X-Modal as the auxiliary modality with prompt tuning. (c) Our proposed SDSTrack utilizes adapter-based tuning to fine-tune the pre-trained RGB-based tracker in a symmetric manner. "X-Modal" denotes modalities other than RGB, which can be Depth, Thermal, Event, etc.
Figure 2: Input modality dependency comparison of multimodal object trackers. (a)-(b) Ground truth of RGB flow and X-modal flow. (c)-(e) Score maps of ViPT zhu2023ViPT under RGB drop, X drop, and multimodal random occlusion conditions. (f)-(h) Score maps of our SDSTrack under RGB drop, X drop, and multimodal random occlusion conditions.
Figure 3: The Overall pipeline of our SDSTrack. Firstly, we apply a random masking technique to the RGB-X patch embeddings in a complementary manner, ensuring that at least one modality remains valid. Next, the masked and clean data undergo four feature extraction and fusion stages. These stages have a symmetrical structure comprising ViT blocks and adapters. In each stage, the fused features from both the masked and clean paths undergo self-distillation, which improves the accuracy and robustness of the model. Finally, the RGB and X features are combined and forwarded to the head network to obtain the prediction results.
Figure 4: The components of symmetric multimodal adaptation (SMA). (a) The structure of the adapter. (b) Cross Modal Adaptation for X-modal image feature extraction (c) Multimodal Fusion Adaptation for multimodal image feature fusion.
Figure 5: Overall performance on the LasHeR li2021lasher test set.
...and 3 more figures

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

TL;DR

Abstract

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (8)