Table of Contents
Fetching ...

Middle Fusion and Multi-Stage, Multi-Form Prompts for Robust RGB-T Tracking

Qiming Wang, Yongqiang Bai, Hongxing Song

TL;DR

This work tackles the data-scarce and efficiency-constrained setting of RGB-T tracking by introducing M3PT, a parameter-efficient method built on a novel middle fusion meta-framework. It employs four visual prompt strategies—Uni-modal/Inter-modal Exploration, Middle Fusion, Fusion-modal Enhancement, and Modality-aware/Stage-aware prompts—to leverage upstream RGB trackers while modeling uni-modal, inter-modal, and fusion-modal patterns across two backbone stages. Empirical results across six challenging RGB-T benchmarks show that M3PT surpasses state-of-the-art prompt-fine-tuning methods and remains competitive with full fine-tuning, while tuning only 0.34M parameters. The approach advances practical, robust RGB-T tracking and highlights the potential of modality-aware prompts for multi-modal video understanding.

Abstract

RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: 1) the trade-off between performance and efficiency; 2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34M fine-tuned parameters.

Middle Fusion and Multi-Stage, Multi-Form Prompts for Robust RGB-T Tracking

TL;DR

This work tackles the data-scarce and efficiency-constrained setting of RGB-T tracking by introducing M3PT, a parameter-efficient method built on a novel middle fusion meta-framework. It employs four visual prompt strategies—Uni-modal/Inter-modal Exploration, Middle Fusion, Fusion-modal Enhancement, and Modality-aware/Stage-aware prompts—to leverage upstream RGB trackers while modeling uni-modal, inter-modal, and fusion-modal patterns across two backbone stages. Empirical results across six challenging RGB-T benchmarks show that M3PT surpasses state-of-the-art prompt-fine-tuning methods and remains competitive with full fine-tuning, while tuning only 0.34M parameters. The approach advances practical, robust RGB-T tracking and highlights the potential of modality-aware prompts for multi-modal video understanding.

Abstract

RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: 1) the trade-off between performance and efficiency; 2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34M fine-tuned parameters.
Paper Structure (26 sections, 11 equations, 15 figures, 10 tables)

This paper contains 26 sections, 11 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of different fusion tracking frameworks. (a)-(c) are the three mainstream frameworks: image-level, feature-level, and decision-level multi-modal fusion tracking frameworks, respectively. (d) is our proposed middle fusion meta-framework. Our framework, unlike the other three, splits the backbone into dual-stream and single-stream structures for uni-modal and fusion-modal feature representation, respectively, with the fusion module situated between them.
  • Figure 2: The tracking pipeline of the RGB-based foundation model. The foundation model is a one-stream one-stage RGB tracker based on transformer backbone. The subscript V denotes the visible modality, and the superscript i denotes the layer number of the Transformer Encoder Block.
  • Figure 3: Pipeline of our M3PT. In this method, two modal images of the same size go through five steps: embedding, uni-modal and inter-modal exploration, middle fusion, fusion-modal enhancing, and state estimation, to obtain the predicted bounding box. Here, the L transformer encoder blocks from the upstream model are divided into two groups, for the uni-modal and inter-modal exploration and fusion-modal enhancing modeling stages respectively. The subscripts and superscripts of the modules and symbols indicate the modality and the layer number respectively.
  • Figure 4: The pipeline of our Uni-modal and Inter-modal Exploration Prompt Strategy and overall architecture of two lightweight prompters which include Uni-modal Exploration-assisted Prompter (UEP) and Inter-modal Self-adaptive Prompter (IP). In the prompt strategy, the modality-independent information extracted by our designed UEP is firstly added to the output of the encoder block of the same modality as intra-modal prompts, and the prompted template tokens are further utilized by our designed IP to generate effective inter-modal scenary prompts.
  • Figure 5: Overall structure of our designed Middle Fusion Prompter (MFP).
  • ...and 10 more figures