Table of Contents
Fetching ...

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking

Jiawen Zhu, Huayi Tang, Xin Chen, Xinying Wang, Dong Wang, Huchuan Lu

TL;DR

This work addresses efficient visual tracking on resource-constrained devices by introducing AsymTrack, an asymmetric Siamese tracker that preserves the speed of two-stream architectures while gaining the precision of one-stream designs. It achieves this by computing the template once at initialization and injecting modulation signals into the search branch via an Efficient Template Modulation mechanism, augmented by an Object Perception Enhancement module and a lightweight re-parameterization strategy for inference. The approach yields state-of-the-art speed-precision trade-offs across GPU, CPU, and edge devices, with AsymTrack-B attaining the highest AO on GOT-10k (67.7%) and competitive LaSOT performance, and AsymTrack-T delivering real-time speeds (up to 224 FPS on GPU). These results demonstrate practical applicability for real-world deployment in UAVs and embodied robots, where both latency and accuracy are critical.

Abstract

Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named \textbf{AsymTrack} for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8\% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0\% AUC with higher speeds. The code is available at https://github.com/jiawen-zhu/AsymTrack.

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking

TL;DR

This work addresses efficient visual tracking on resource-constrained devices by introducing AsymTrack, an asymmetric Siamese tracker that preserves the speed of two-stream architectures while gaining the precision of one-stream designs. It achieves this by computing the template once at initialization and injecting modulation signals into the search branch via an Efficient Template Modulation mechanism, augmented by an Object Perception Enhancement module and a lightweight re-parameterization strategy for inference. The approach yields state-of-the-art speed-precision trade-offs across GPU, CPU, and edge devices, with AsymTrack-B attaining the highest AO on GOT-10k (67.7%) and competitive LaSOT performance, and AsymTrack-T delivering real-time speeds (up to 224 FPS on GPU). These results demonstrate practical applicability for real-world deployment in UAVs and embodied robots, where both latency and accuracy are critical.

Abstract

Efficient tracking has garnered attention for its ability to operate on resource-constrained platforms for real-world deployment beyond desktop GPUs. Current efficient trackers mainly follow precision-oriented trackers, adopting a one-stream framework with lightweight modules. However, blindly adhering to the one-stream paradigm may not be optimal, as incorporating template computation in every frame leads to redundancy, and pervasive semantic interaction between template and search region places stress on edge devices. In this work, we propose a novel asymmetric Siamese tracker named \textbf{AsymTrack} for efficient tracking. AsymTrack disentangles template and search streams into separate branches, with template computing only once during initialization to generate modulation signals. Building on this architecture, we devise an efficient template modulation mechanism to unidirectional inject crucial cues into the search features, and design an object perception enhancement module that integrates abstract semantics and local details to overcome the limited representation in lightweight tracker. Extensive experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms compared to the current state-of-the-arts. For instance, AsymTrack-T achieves 60.8\% AUC on LaSOT and 224/81/84 FPS on GPU/CPU/AGX, surpassing HiT-Tiny by 6.0\% AUC with higher speeds. The code is available at https://github.com/jiawen-zhu/AsymTrack.

Paper Structure

This paper contains 16 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: AsymTrack vs. other frameworks and trackers. (a)-(c) represent Siamese (two-stream) network, one-stream network and our asymmetric Siamese network, respectively. colors represent networks in initialization and inference phrases, respectively. Diagrams (d)&(e) display comparisons of speed-precision trade-offs on CPU and Jetson AGX Xavier platforms. The parameters and FLOPs are represented by the area of circles in (d) and (e), respectively.
  • Figure 2: Overview of AsymTrack. It employs an asymmetric Siamese pipeline, where the template branch runs once during initialization, generating features and prototype that are unidirectionally fed to the search region branch for online inference.
  • Figure 3: Efficient template modulation (ETM) mechanism.
  • Figure 4: Object perception enhancement (OPE) module.
  • Figure 5: VOT real-time testing on Jetson AGX Xavier.
  • ...and 2 more figures