Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

Xinyu Zhou; Pinxue Guo; Lingyi Hong; Jinglun Li; Wei Zhang; Weifeng Ge; Wenqiang Zhang

Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

Xinyu Zhou, Pinxue Guo, Lingyi Hong, Jinglun Li, Wei Zhang, Weifeng Ge, Wenqiang Zhang

TL;DR

The paper tackles the problem that visual object tracking benefits from historical reference information but can be hampered by redundant or harmful data when all past features are used. It introduces Reading Relevant Feature from Global Representation Memory (RFGM), a framework that builds a Global Representation Memory (GR Memory) at the token level and uses a relevance attention mechanism to selectively read the most relevant historical tokens for each frame. Through an adaptive token ranking module and differentiable memory updates, the method stores representative target features across the video and updates them with incoming templates, achieving robust tracking. Experiments on multiple benchmarks demonstrate competitive accuracy and a high inference speed of 71 FPS, illustrating the practical impact of adaptive memory reading for real-time tracking.

Abstract

Reference features from a template or historical frames are crucial for visual object tracking. Prior works utilize all features from a fixed template or memory for visual object tracking. However, due to the dynamic nature of videos, the required reference historical information for different search regions at different time steps is also inconsistent. Therefore, using all features in the template and memory can lead to redundancy and impair tracking performance. To alleviate this issue, we propose a novel tracking paradigm, consisting of a relevance attention mechanism and a global representation memory, which can adaptively assist the search region in selecting the most relevant historical information from reference features. Specifically, the proposed relevance attention mechanism in this work differs from previous approaches in that it can dynamically choose and build the optimal global representation memory for the current frame by accessing cross-frame information globally. Moreover, it can flexibly read the relevant historical information from the constructed memory to reduce redundancy and counteract the negative effects of harmful information. Extensive experiments validate the effectiveness of the proposed method, achieving competitive performance on five challenging datasets with 71 FPS.

Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

TL;DR

Abstract

Paper Structure (14 sections, 16 equations, 5 figures, 7 tables)

This paper contains 14 sections, 16 equations, 5 figures, 7 tables.

Introduction
Related work
Visual Object Tracking Paradigms
Attention mechanism
Memory networks
Method
Tracking with GR memory and relevance attention
Adaptive Token Ranking
Loss Fucntion
Experiments
Implementation Details
State-of-the-Art Comparisons
Ablation study
Conlusion and Limitation

Figures (5)

Figure 1: Three different methods of tracking pipeline. The purple dots represent the selected points from the new template that are updated into the memory, while the red dots and blue dots represent the selected points that are fed into the relation model.
Figure 2: The framework of RFGM. It consists of a GR memory, token filter(TF), an encoder, and a decoder. The encoder is composed of Attention and relevance attention, while the decoder consists of a prediction head.
Figure 3: The token filter consists of three regular transformer blocks and an adaptive token rank, which effectively updates the features in memory. GR-M represents the global representation memory, T stands for the new template, and S represents the search region.
Figure 4: Visualization of relevance attention. Taking one template as an example, white areas represent discarded regions, while the remaining areas represent the regions selected by relevance attention. Stages 1 to 3 indicate the progressive application of three relevance attention layers.
Figure 5: Visualization of GR memory updates. Over time t, the GR memory accumulates an increasing number of representative penguin features. In the second and third rows, white tokens represent discards, while the other tokens are retained in memory.

Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

TL;DR

Abstract

Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (5)