Table of Contents
Fetching ...

ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking

Yushan Han, Kaer Huang

TL;DR

This paper addresses the inefficiency of training trackers from scratch or fine-tuning large models while maintaining strong tracking performance. It proposes ACTrack, which freezes a pre-trained Transformer backbone and adds a lightweight additive net to model spatio-temporal relations, using a Siamese convolutional design and temporal sequence modeling to predict object coordinates frame-by-frame. The method preserves global context while attending to local features, delivering state-of-the-art results on multiple benchmarks. It also reduces training time and memory usage, enabling efficient deployment without sacrificing accuracy on challenging datasets such as VOT2020, LaSOT, TrackingNet, and GOT-10k.

Abstract

Efficiently modeling spatio-temporal relations of objects is a key challenge in visual object tracking (VOT). Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. Moreover, training trackers from scratch or fine-tuning large pre-trained models needs more time and memory consumption. In this paper, we present ACTrack, a new tracking framework with additive spatio-temporal conditions. It preserves the quality and capabilities of the pre-trained Transformer backbone by freezing its parameters, and makes a trainable lightweight additive net to model spatio-temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and perform temporal sequence modeling to simplify the tracking pipeline. Experimental results on several benchmarks prove that ACTrack could balance training efficiency and tracking performance.

ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking

TL;DR

This paper addresses the inefficiency of training trackers from scratch or fine-tuning large models while maintaining strong tracking performance. It proposes ACTrack, which freezes a pre-trained Transformer backbone and adds a lightweight additive net to model spatio-temporal relations, using a Siamese convolutional design and temporal sequence modeling to predict object coordinates frame-by-frame. The method preserves global context while attending to local features, delivering state-of-the-art results on multiple benchmarks. It also reduces training time and memory usage, enabling efficient deployment without sacrificing accuracy on challenging datasets such as VOT2020, LaSOT, TrackingNet, and GOT-10k.

Abstract

Efficiently modeling spatio-temporal relations of objects is a key challenge in visual object tracking (VOT). Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. Moreover, training trackers from scratch or fine-tuning large pre-trained models needs more time and memory consumption. In this paper, we present ACTrack, a new tracking framework with additive spatio-temporal conditions. It preserves the quality and capabilities of the pre-trained Transformer backbone by freezing its parameters, and makes a trainable lightweight additive net to model spatio-temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and perform temporal sequence modeling to simplify the tracking pipeline. Experimental results on several benchmarks prove that ACTrack could balance training efficiency and tracking performance.
Paper Structure (3 sections, 2 figures)

This paper contains 3 sections, 2 figures.

Figures (2)

  • Figure 1: Comparison of tracking frameworks. (a) The framework with three components: a CNN backbone, a feature integration module, and task-specific heads. (b) The framework with trainable Transformer backbone and task-specific heads. (c) Our ACTrack freezes the pre-trained Transformer and trains an additive lightweight conditional net to model spatio-temporal relations.
  • Figure 2: The overall architecture of ACTrack. The parameters of pre-trained Transformer backbone are frozen, and trainable additive net extracts generic features of template and search images. The track sequence are transferred from previous frame and concatenated with generic features to generate spatio-temporal conditional queries. The tracking object coordinate is predicted frame-by-frame through temporal sequence modeling.