Table of Contents
Fetching ...

Multi-attention Associate Prediction Network for Visual Tracking

Xinglong Sun, Haijiang Sun, Shan Jiang, Jiacheng Wang, Xilai Wei, Zhonghe Hu

TL;DR

A Siamese tracker is described built upon the proposed prediction network, which achieves the leading performance on five tracking benchmarks, consisting of LaSOT, TrackingNet, GOT-10k, TNL2k and UAV123, and surpasses other state-of-the-art approaches.

Abstract

Classification-regression prediction networks have realized impressive success in several modern deep trackers. However, there is an inherent difference between classification and regression tasks, so they have diverse even opposite demands for feature matching. Existed models always ignore the key issue and only employ a unified matching block in two task branches, decaying the decision quality. Besides, these models also struggle with decision misalignment situation. In this paper, we propose a multi-attention associate prediction network (MAPNet) to tackle the above problems. Concretely, two novel matchers, i.e., category-aware matcher and spatial-aware matcher, are first designed for feature comparison by integrating self, cross, channel or spatial attentions organically. They are capable of fully capturing the category-related semantics for classification and the local spatial contexts for regression, respectively. Then, we present a dual alignment module to enhance the correspondences between two branches, which is useful to find the optimal tracking solution. Finally, we describe a Siamese tracker built upon the proposed prediction network, which achieves the leading performance on five tracking benchmarks, consisting of LaSOT, TrackingNet, GOT-10k, TNL2k and UAV123, and surpasses other state-of-the-art approaches.

Multi-attention Associate Prediction Network for Visual Tracking

TL;DR

A Siamese tracker is described built upon the proposed prediction network, which achieves the leading performance on five tracking benchmarks, consisting of LaSOT, TrackingNet, GOT-10k, TNL2k and UAV123, and surpasses other state-of-the-art approaches.

Abstract

Classification-regression prediction networks have realized impressive success in several modern deep trackers. However, there is an inherent difference between classification and regression tasks, so they have diverse even opposite demands for feature matching. Existed models always ignore the key issue and only employ a unified matching block in two task branches, decaying the decision quality. Besides, these models also struggle with decision misalignment situation. In this paper, we propose a multi-attention associate prediction network (MAPNet) to tackle the above problems. Concretely, two novel matchers, i.e., category-aware matcher and spatial-aware matcher, are first designed for feature comparison by integrating self, cross, channel or spatial attentions organically. They are capable of fully capturing the category-related semantics for classification and the local spatial contexts for regression, respectively. Then, we present a dual alignment module to enhance the correspondences between two branches, which is useful to find the optimal tracking solution. Finally, we describe a Siamese tracker built upon the proposed prediction network, which achieves the leading performance on five tracking benchmarks, consisting of LaSOT, TrackingNet, GOT-10k, TNL2k and UAV123, and surpasses other state-of-the-art approaches.
Paper Structure (40 sections, 13 equations, 10 figures, 5 tables)

This paper contains 40 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Classification and regression similarity maps produced by MAPNet. The prediction network can extract more category-related responses for classification and local texture information for location.
  • Figure 2: Overview of the proposed prediction network, consisting of category-aware matchers, spatial-aware matchers and dual alignment module. Ch-Attn, Sp-Attn, Sf-Attn and Cs-Attn represent the channel, spatial, self and cross attentions, respectively. The features of template and search region are first compared by diverse matchers, and then two kinds of similarity maps are aligned by the dual alignment module.
  • Figure 3: Architecture of our designed category-aware matcher, which is composed of combining self, cross and channel attentions.
  • Figure 4: Pipeline of Siamese tracker based on the proposed prediction network, which is constructed by backbone, prediction network and prediction heads.
  • Figure 5: Success and Normalized precision plots of all trackers in OPE formulation on LaSOT. These trackers are ranked according to their performance scores.
  • ...and 5 more figures