Table of Contents
Fetching ...

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

Ruopeng Gao, Limin Wang

TL;DR

MeMOTR tackles long-term temporal modeling in multi-object tracking by introducing per-target long-term memory and a memory-attention mechanism within a DETR-based Transformer. A separated Detection Decoder and a Temporal Interaction Module with adaptive aggregation and memory-based trajectory interactions stabilize and distinguish track embeddings, substantially improving association metrics. Across DanceTrack, MOT17, and BDD100K, MeMOTR achieves state-of-the-art or competitive results, with notable gains in AssA and IDF1, and demonstrates strong generalization to multi-class tracking. The work underscores the practical value of long-term temporal information for robust object association in challenging motion and occlusion scenarios.

Abstract

As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

TL;DR

MeMOTR tackles long-term temporal modeling in multi-object tracking by introducing per-target long-term memory and a memory-attention mechanism within a DETR-based Transformer. A separated Detection Decoder and a Temporal Interaction Module with adaptive aggregation and memory-based trajectory interactions stabilize and distinguish track embeddings, substantially improving association metrics. Across DanceTrack, MOT17, and BDD100K, MeMOTR achieves state-of-the-art or competitive results, with notable gains in AssA and IDF1, and demonstrates strong generalization to multi-class tracking. The work underscores the practical value of long-term temporal information for robust object association in challenging motion and occlusion scenarios.

Abstract

As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.
Paper Structure (21 sections, 3 equations, 5 figures, 10 tables)

This paper contains 21 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of MeMOTR. Like most DETR-based DETR methods, we exploit a ResNet-50 ResNet backbone and a Transformer Attention Encoder to learn a 2D representation of an input image. We use different colors to indicate different tracked targets, and the learnable detect query $Q_{det}$ is illustrated in gray. Then the Detection Decoder $\mathcal{D}_{det}$ processes the detect query to generate the detect embedding $E_{det}^t$, which aligns with the track embedding $E_{tck}^t$ from previous frames. Long-term memory is denoted as $M_{tck}^t$. The initialization process in the blue dotted arrow will be applied to newborn objects. Our Long-Term Memory and Temporal Interaction Module is discussed in Section \ref{['Section:Long-Term-Memory']} and \ref{['Section:TIM']}. More details are illustrated in Figure \ref{['Fig:Temporal-Interaction-Module']}.
  • Figure 2: Illustration of Temporal Interaction Module.$\widetilde{E}_{tck}^{t+1}$ and $\widetilde{M}_{tck}^{t+1}$ are the prediction of $E_{tck}^t$ and $M_{tck}^t$ for the next frame, respectively.
  • Figure 3: Visualize the anchors of tracked and newborn targets before (left) and after (right) the separated detection decoder $\mathcal{D}_{det}$.
  • Figure 4: Visualization of Track Embedding $E_{tck}^t$ (the first $50$ frames in sequence dancetrack0063) from different structure designs by using t-Distributed Stochastic Neighbor Embedding (t-SNE). Track embeddings for different tracked targets (IDs) are marked in different colors and shapes. Our design \ref{['Fig:Track-Embedding:memory+attention']} helps the model learn a more stable and distinguishable representation for the track embedding. Corresponding tracking performance is shown in Table \ref{['Table:Temporal-Interaction']}.
  • Figure 5: Visualizing track embedding $E_{tck}^t$ from the first 50 frames of dancetrack0025 (upper) and dancetrack0034 sequences (lower). Track embeddings for different tracked targets (IDs) are marked in different colors and shapes. The visualizations of our method are shown in Figure \ref{['0025:end']} and \ref{['0034:end']}.