MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
Ruopeng Gao, Limin Wang
TL;DR
MeMOTR tackles long-term temporal modeling in multi-object tracking by introducing per-target long-term memory and a memory-attention mechanism within a DETR-based Transformer. A separated Detection Decoder and a Temporal Interaction Module with adaptive aggregation and memory-based trajectory interactions stabilize and distinguish track embeddings, substantially improving association metrics. Across DanceTrack, MOT17, and BDD100K, MeMOTR achieves state-of-the-art or competitive results, with notable gains in AssA and IDF1, and demonstrates strong generalization to multi-class tracking. The work underscores the practical value of long-term temporal information for robust object association in challenging motion and occlusion scenarios.
Abstract
As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.
