Table of Contents
Fetching ...

Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

Yabin Zhu, Jianqi Li, Chenglong Li, Jiaxiang Wang, Chengjie Gu, Jin Tang

Abstract

Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.

Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

Abstract

Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.
Paper Structure (23 sections, 10 equations, 6 figures, 7 tables)

This paper contains 23 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison with different mixture of experts structures.
  • Figure 2: Overview architecture of our proposed method. Here, we embed the Spare-Dense Mixture of Experts (SDMoE) module as the multi-modal adapter to effectively model modality-specific and shared information of fused modalities. Moreover, we also embed the Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module to address the modeling of higher-order correlations in multi-level and multi-modal feature fusion, achieving more robust feature representations.
  • Figure 3: Detailed design of the proposed SDMoE module. The SDMoE consists of a spare MoE and a dense-shared MoE. Spare MoE has a router and $N$ specific experts. The details of specific expert is shown on the right of the figure. Correspondingly, details about dense-shared MoE are shown on the left of the figure. Specific-E and FFN-E denote specific expert and the feedforward expert network of dense-shared MoE, respectively. The SiLU is a activation function, called Sigmoid Linear Unit.
  • Figure 4: The detailed design of the proposed GSAHF module. The $\mathbf{LW}$ denotes learnable weights.
  • Figure 5: Visualization results on multi-modal tracking datasets.
  • ...and 1 more figures