Table of Contents
Fetching ...

Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking

Boyue Xu, Ruichao Hou, Tongwei Ren, Dongming zhou, Gangshan Wu, Jinde Cao

TL;DR

The work addresses robust multi-modal visual tracking by introducing a dual-adapter prompting framework that operates on a frozen RGB backbone. It combines a frequency-guided visual adapter for cross-modal fusion across spatial, channel, and frequency dimensions with a multilevel memory adapter to propagate reliable temporal context across long sequences. The approach yields state-of-the-art results on RGB-T, RGB-D, and RGB-E benchmarks with favorable parameter efficiency and runtime, demonstrating strong cross-modal interaction and temporal coherence without full fine-tuning. This enables robust tracking under occlusion, motion blur, and illumination changes, with practical impact for real-world multi-modal perception systems.

Abstract

Prompt-learning-based multi-modal trackers have made strong progress by using lightweight visual adapters to inject auxiliary-modality cues into frozen foundation models. However, they still underutilize two essentials: modality-specific frequency structure and long-range temporal dependencies. We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker. A frequency-guided visual adapter adaptively transfers complementary cues across modalities by jointly calibrating spatial, channel, and frequency components, narrowing the modality gap without full fine-tuning. A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context, enabling consistent propagation across frames and robust recovery from occlusion, motion blur, and illumination changes. This unified design preserves the efficiency of prompt learning while strengthening cross-modal interaction and temporal coherence. Extensive experiments on RGB-Thermal, RGB-Depth, and RGB-Event benchmarks show consistent state-of-the-art results over fully fine-tuned and adapter-based baselines, together with favorable parameter efficiency and runtime. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.

Learning Frequency and Memory-Aware Prompts for Multi-Modal Object Tracking

TL;DR

The work addresses robust multi-modal visual tracking by introducing a dual-adapter prompting framework that operates on a frozen RGB backbone. It combines a frequency-guided visual adapter for cross-modal fusion across spatial, channel, and frequency dimensions with a multilevel memory adapter to propagate reliable temporal context across long sequences. The approach yields state-of-the-art results on RGB-T, RGB-D, and RGB-E benchmarks with favorable parameter efficiency and runtime, demonstrating strong cross-modal interaction and temporal coherence without full fine-tuning. This enables robust tracking under occlusion, motion blur, and illumination changes, with practical impact for real-world multi-modal perception systems.

Abstract

Prompt-learning-based multi-modal trackers have made strong progress by using lightweight visual adapters to inject auxiliary-modality cues into frozen foundation models. However, they still underutilize two essentials: modality-specific frequency structure and long-range temporal dependencies. We present Learning Frequency and Memory-Aware Prompts, a dual-adapter framework that injects lightweight prompts into a frozen RGB tracker. A frequency-guided visual adapter adaptively transfers complementary cues across modalities by jointly calibrating spatial, channel, and frequency components, narrowing the modality gap without full fine-tuning. A multilevel memory adapter with short, long, and permanent memory stores, updates, and retrieves reliable temporal context, enabling consistent propagation across frames and robust recovery from occlusion, motion blur, and illumination changes. This unified design preserves the efficiency of prompt learning while strengthening cross-modal interaction and temporal coherence. Extensive experiments on RGB-Thermal, RGB-Depth, and RGB-Event benchmarks show consistent state-of-the-art results over fully fine-tuned and adapter-based baselines, together with favorable parameter efficiency and runtime. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.

Paper Structure

This paper contains 21 sections, 17 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Framework comparisons between the existing prompt-learning-based tracker and our tracker. (a) Existing trackers propagate temporal cues from adjacent frames and fuse multi-modal features in channel and spatial dimensions. (b) The proposed method integrates a memory adapter to propagate cues adaptively and merge features in channel, spatial, and frequency dimensions.
  • Figure 2: Illustration of frequency-domain characteristics for RGB-T, RGB-D and RGB-E. The second to fourth rows show the magnitude map, the low-frequency visualization, and the high-frequency visualization, respectively. Red boxes indicate the most informative regions in the low-frequency domain, while green boxes highlight the most informative regions in the high-frequency domain.
  • Figure 3: The framework of the proposed method. We first transform the templates and search region of each modality into tokens, then concatenate them with temporal cue tokens and feed them into the $L$-layer ViT block. The visual adapter and memory adapter are paralleled with the ViT block. The memory adapter is used to propagate the valuable temporal cues across frames, and the visual adapter is used for modality interaction and fusion. The output features are fed into the prediction head to produce the tracking results.
  • Figure 4: Detailed design of the frequency-guide multi-modal fusion module, which enhances the feature representation by combining spatial, channel, and frequency information from different modalities.
  • Figure 5: Detailed design of memory update and memory retrieval, which ensures the most reliable tracking cues are propagated in the subsequent sequence.
  • ...and 4 more figures