Table of Contents
Fetching ...

DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

Weihong Li, Shaohua Dong, Haonan Lu, Yanhao Zhang, Heng Fan, Libo Zhang

TL;DR

This paper explores adapter tuning and introduces a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack, which achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters.

Abstract

In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters. Extensive experiments on five benchmarks demonstrate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch-Fox11/DMTrack.

DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

TL;DR

This paper explores adapter tuning and introduces a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack, which achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters.

Abstract

In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely 0.93M trainable parameters. Extensive experiments on five benchmarks demonstrate that DMTrack achieves state-of-the-art results. Our code and models will be available at https://github.com/Nightwatch-Fox11/DMTrack.

Paper Structure

This paper contains 16 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison of existing unified multimodal trackers and our proposed DMTrack in frameworks (a)-(b) and performance (c). Best viewed in color for all figures in this paper.
  • Figure 2: Overview of the proposed DMTrack. We first tokenize the template and search frames from each modality, then concatenate the resulting token sequences and process them through the frozen transformer architecture. Within each block structure, the STMA remains the only trainable component, specifically designed to produce self-prompts that encode intra-modal spatio-temporal relationships. The PMCA module bridges two processing branches through a twin-adapter architecture, where a shallow adapter and a deep adapter progressively synthesize inter-modal complementary prompts.
  • Figure 3: Detailed design of STMA. In STMA, the temporal context is extracted from Template Memory via a 1D convolutional layer.
  • Figure 4: Detailed design of Shallow Adapter. Multimodal input flows are processed through three FC layers to generate foundational cross-modal complementary prompts, which are subsequently supplied to another modality branch.
  • Figure 5: Detailed design of Deep Adapter. In deep adapter, we construct both Key and Value using dual modalities, enabling pixel-wise attention to simultaneously refine intra-modal representations while adaptively fusing cross-modal information.
  • ...and 2 more figures