Table of Contents
Fetching ...

Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

Tianqi Shen, Huakao Lin, Ning An

TL;DR

This work redesigns the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead, and introduces a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices.

Abstract

Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).

Search Multilayer Perceptron-Based Fusion for Efficient and Accurate Siamese Tracking

TL;DR

This work redesigns the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead, and introduces a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices.

Abstract

Siamese visual trackers have recently advanced through increasingly sophisticated fusion mechanisms built on convolutional or Transformer architectures. However, both struggle to deliver pixel-level interactions efficiently on resource-constrained hardware, leading to a persistent accuracy-efficiency imbalance. Motivated by this limitation, we redesign the Siamese neck with a simple yet effective Multilayer Perception (MLP)-based fusion module that enables pixel-level interaction with minimal structural overhead. Nevertheless, naively stacking MLP blocks introduces a new challenge: computational cost can scale quadratically with channel width. To overcome this, we construct a hierarchical search space of carefully designed MLP modules and introduce a customized relaxation strategy that enables differentiable neural architecture search (DNAS) to decouple channel-width optimization from other architectural choices. This targeted decoupling automatically balances channel width and depth, yielding a low-complexity architecture. The resulting tracker achieves state-of-the-art accuracy-efficiency trade-offs. It ranks among the top performers on four general-purpose and three aerial tracking benchmarks, while maintaining real-time performance on both resource-constrained Graphics Processing Units (GPUs) and Neural Processing Units (NPUs).
Paper Structure (46 sections, 15 equations, 12 figures, 7 tables, 2 algorithms)

This paper contains 46 sections, 15 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Comparison between the conventional pipeline and the proposed pipeline. In subfigure ($\boldsymbol{a}$), the template feature $\boldsymbol{F}^t$ and search feature $\boldsymbol{F}^s$ are processed using manually designed module(s). Alternatively, ($\boldsymbol{b}$) integrates the fusion architecture into a searchable space, enabling the discovery of a lightweight yet effective neck design.
  • Figure 2: ($\boldsymbol{a}$) CNN-based fusion uses template feature maps $\boldsymbol{F}^t$ as kernels sliding over search feature maps $\boldsymbol{F}^s$, performing patch-level fusion efficiently on hardware. ($\boldsymbol{b}$) Transformer-based fusion applies cross-attention between template tokens $\boldsymbol{F}^t$ and search tokens $\boldsymbol{F}^s$, achieving pixel-level fusion but with slower hardware execution. ($\boldsymbol{c}$) Our MLP-based fusion first performs coarse fusion on $\boldsymbol{F}^t$ and $\boldsymbol{F}^s$, followed by refinement using Wave-MLP blocks, resulting in hardware-efficient pixel-level fusion. ($\boldsymbol{d}$) The MLP-based approach yields a better accuracy–speed trade-off than its CNN and Transformer counterparts.
  • Figure 3: Overview of our proposed tracker and MCAS. i. Versatile architecture with two backbones: yan2021lighttrack for resource-constrained GPU platforms and krizhevsky2017imagenet for resource-limited NPU deployment. ii. The primary MLP-based components in the MCAS include the Coarse Fusion MLP (CFM) Module and the Refine Fusion MLP (RFM) Module. CFM performs coarse fusion to generate the heatmap, while RFM refines it to produce the response map. iii. Tracking-head network transforms the response map into classification and regression response maps, representing the predictive classification and bounding box of the target.
  • Figure 4: Structure of the harmonization block $H_i$ and its basic layer $C_{i1}$. Top: Basic layers are categorized into CH layers ($C_{i1},\cdots,C_{ik},\cdots,C_{ii}$ with $1\leq k\leq i$) and PH layers ($P_{i1},P_{i2},P_{i3}$). Bottom: $C_{i1}$ consists of 12 Wave-MLP blocks with different configurations. "wavemlp_k5_t4_dw" refers to a Wave-MLP block where the PATM module uses a $5{\times}5$ convolution kernel ("k5"), the MLP expansion ratio is 4 ("t4"), and depthwise separable convolution is applied ("dw"). Each CH or PH layer is a basic layer.
  • Figure 5: Comparative analysis of tracker attributes on VOT2019. The radar chart annotates the minimum and maximum EAO scores per attribute. SEAT_LT attains consistently high EAO scores across attributes—except for illumination change—and surpasses other lightweight GPU-oriented trackers, with particularly strong gains under occlusion.
  • ...and 7 more figures