Table of Contents
Fetching ...

Optimizing Multi-Modality Trackers via Sensitivity-regularized Tuning

Zhiwen Chen, Jinjian Wu, Zhiyu Zhu, Yifan Zhang, Guangming Shi, Junhui Hou

TL;DR

This work addresses the misfitting challenges that arise when adapting RGB-pretrained trackers to multi-modal tracking tasks. It introduces sensitivity-regularized fine-tuning (SRFT), which leverages two intrinsic parameter sensitivities—prior sensitivity (via a Fisher Information Matrix tangent-space analysis) and transfer sensitivity (via gradient sparsity metrics)—to regulate gradient updates during cross-modal fine-tuning. A dynamic schedule controlled by $\kappa$ balances preserving pre-trained knowledge with adapting to new modalities, resulting in a low-rank, tangent-space-constrained optimization that improves transferability across RGB-Event, RGB-Depth, and RGB-Thermal benchmarks. Extensive experiments show SRFT consistently surpasses state-of-the-art methods on seven benchmarks and demonstrates compatibility with existing transfer-learning paradigms, highlighting its practical impact for robust, cross-modal visual tracking. The approach introduces a principled, data-informed way to navigate the plasticity-stability trade-off in cross-domain transfer, with potential applicability to other multi-modal perception tasks.

Abstract

This paper tackles the critical challenge of optimizing multi-modality trackers by effectively adapting pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-regularized fine-tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation of the transition from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are the primary drivers of this issue. Specifically, we first probe the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Subsequently, we characterize transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as unified regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of our method, surpassing current state-of-the-art techniques across various multi-modality tracking benchmarks. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.

Optimizing Multi-Modality Trackers via Sensitivity-regularized Tuning

TL;DR

This work addresses the misfitting challenges that arise when adapting RGB-pretrained trackers to multi-modal tracking tasks. It introduces sensitivity-regularized fine-tuning (SRFT), which leverages two intrinsic parameter sensitivities—prior sensitivity (via a Fisher Information Matrix tangent-space analysis) and transfer sensitivity (via gradient sparsity metrics)—to regulate gradient updates during cross-modal fine-tuning. A dynamic schedule controlled by balances preserving pre-trained knowledge with adapting to new modalities, resulting in a low-rank, tangent-space-constrained optimization that improves transferability across RGB-Event, RGB-Depth, and RGB-Thermal benchmarks. Extensive experiments show SRFT consistently surpasses state-of-the-art methods on seven benchmarks and demonstrates compatibility with existing transfer-learning paradigms, highlighting its practical impact for robust, cross-modal visual tracking. The approach introduces a principled, data-informed way to navigate the plasticity-stability trade-off in cross-domain transfer, with potential applicability to other multi-modal perception tasks.

Abstract

This paper tackles the critical challenge of optimizing multi-modality trackers by effectively adapting pre-trained models for RGB data. Existing fine-tuning paradigms oscillate between excessive freedom and over-restriction, both leading to a suboptimal plasticity-stability trade-off. To mitigate this dilemma, we propose a novel sensitivity-regularized fine-tuning framework, which delicately refines the learning process by incorporating intrinsic parameter sensitivities. Through a comprehensive investigation of the transition from pre-trained to multi-modal contexts, we identify that parameters sensitive to pivotal foundational patterns and cross-domain shifts are the primary drivers of this issue. Specifically, we first probe the tangent space of pre-trained weights to measure and orient prior sensitivities, dedicated to preserving generalization. Subsequently, we characterize transfer sensitivities during the tuning phase, emphasizing adaptability and stability. By incorporating these sensitivities as unified regularization terms, our method significantly enhances the transferability across modalities. Extensive experiments showcase the superior performance of our method, surpassing current state-of-the-art techniques across various multi-modality tracking benchmarks. The source code and models will be publicly available at https://github.com/zhiwen-xdu/SRTrack.

Paper Structure

This paper contains 36 sections, 1 theorem, 26 equations, 13 figures, 16 tables, 1 algorithm.

Key Result

Proposition 1

Let $\tilde{\mathcal{F}}^{(\theta_0^j)} = {(V^{j})} \Lambda^{j} {(V^{j}})^{T}$ be the eigen-decomposition of group $j$, with top-$K$ eigenvalues $\Lambda^{j}=\mathrm{diag}(\lambda^{j}_1,\dots,\lambda^{j}_K)$ and eigenvectors $V^{j}=[v^{j}_1,\dots,v^{j}_K]$. We construct the following approximation: where $\tilde{\mathcal{F}^{(\theta_0)}} = \mathrm{diag}(\tilde{\mathcal{F}}^{(\theta_0^1)},\dots,\t

Figures (13)

  • Figure 1: Optimization trajectory analysis on LasHeR. This plot contrasts the training and testing dynamics of different tuning paradigms. As visualized by the colorful arrows, our method effectively mitigates the misfitting issue and enhances multi-modality trackers with superior generalization and stability.
  • Figure 2: Network architecture of the multi-modality trackers. All modules are initialized with the weights of a pre-trained RGB tracker. We only fine-tune backbones and fusion blocks with sensitivity-aware regularization terms.
  • Figure 3: Loss-parameter manifold schematic across different fine-tuning paradigms.FFT updates all weights without constraints, risking severe forgetting of pre-trained knowledge (deteriorating pre-trained loss, L1). While PEFT restricts updates to additional parameters, this limits performance on new domains (stalled transferred loss, L2). In contrast, our SRFT performs optimization within an approximated pre-trained tangent space, indicating a better plasticity-stability trade-off.
  • Figure 4: Operation-wise prior sensitivity (i.e., eigen-decomposition FIM map) of the pre-trained OSTrack. The high prior sensitivity indicates a tendency to deviate from the pre-trained tangent space, reflecting the disruption of pretrained knowledge.
  • Figure 5: Instantaneous transfer sensitivity on the VisEvent dataset observed during fine-tuning. The high sparsity in the sensitivity map indicates that only a few gradients (light areas) dominate the updates.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Proposition 1: Eigen-based Approximation Error Bound
  • Proof 1