Table of Contents
Fetching ...

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

Liting Lin, Heng Fan, Zhipeng Zhang, Yaowei Wang, Yong Xu, Haibin Ling

TL;DR

This work tackles the high resource costs of training large vision transformers for visual tracking by importing Low-Rank Adaptation (LoRA) into a one-stream ViT-based tracker, yielding substantial efficiency without sacrificing performance. The key innovations are decoupled input embeddings (shared spatial plus token-type and foreground-indication embeddings) and an MLP-only head, enabling parameter-efficient fine-tuning with a low-rank update $\Delta \Theta \approx BA$ and minimal inference overhead. Empirically, LoRAT achieves state-of-the-art SUC on LaSOT (up to 0.762) and strong results on TrackingNet and GOT-10k, while delivering high FPS (up to 209 fps) and feasible training times on modest hardware. The approach demonstrates that large-scale pre-trained ViTs can be effectively leveraged for tracking under resource constraints, broadening accessibility and accelerating progress in the field.

Abstract

Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on MLP to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.742 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models are available at https://github.com/LitingLin/LoRAT.

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance

TL;DR

This work tackles the high resource costs of training large vision transformers for visual tracking by importing Low-Rank Adaptation (LoRA) into a one-stream ViT-based tracker, yielding substantial efficiency without sacrificing performance. The key innovations are decoupled input embeddings (shared spatial plus token-type and foreground-indication embeddings) and an MLP-only head, enabling parameter-efficient fine-tuning with a low-rank update and minimal inference overhead. Empirically, LoRAT achieves state-of-the-art SUC on LaSOT (up to 0.762) and strong results on TrackingNet and GOT-10k, while delivering high FPS (up to 209 fps) and feasible training times on modest hardware. The approach demonstrates that large-scale pre-trained ViTs can be effectively leveraged for tracking under resource constraints, broadening accessibility and accelerating progress in the field.

Abstract

Motivated by the Parameter-Efficient Fine-Tuning (PEFT) in large language models, we propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources. The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency, to the domain of visual tracking. However, unique challenges and potential domain gaps make this transfer not as easy as the first intuition. Firstly, a transformer-based tracker constructs unshared position embedding for template and search image. This poses a challenge for the transfer of LoRA, usually requiring consistency in the design when applied to the pre-trained backbone, to downstream tasks. Secondly, the inductive bias inherent in convolutional heads diminishes the effectiveness of parameter-efficient fine-tuning in tracking models. To overcome these limitations, we first decouple the position embeddings in transformer-based trackers into shared spatial ones and independent type ones. The shared embeddings, which describe the absolute coordinates of multi-resolution images (namely, the template and search images), are inherited from the pre-trained backbones. In contrast, the independent embeddings indicate the sources of each token and are learned from scratch. Furthermore, we design an anchor-free head solely based on MLP to adapt PETR, enabling better performance with less computational overhead. With our design, 1) it becomes practical to train trackers with the ViT-g backbone on GPUs with only memory of 25.8GB (batch size of 16); 2) we reduce the training time of the L-224 variant from 35.0 to 10.8 GPU hours; 3) we improve the LaSOT SUC score from 0.703 to 0.742 with the L-224 variant; 4) we fast the inference speed of the L-224 variant from 52 to 119 FPS. Code and models are available at https://github.com/LitingLin/LoRAT.
Paper Structure (27 sections, 6 equations, 2 figures, 16 tables)

This paper contains 27 sections, 6 equations, 2 figures, 16 tables.

Figures (2)

  • Figure 1: Comparison of tracking models on performance and training efficiency. "$\times$" indicates failure to train due to insufficient memory. Best viewed in color for all figures.
  • Figure 2: Architecture of LoRAT. The template and search region are first split and then projected as patch embeddings. Patch embeddings are added with shared position embeddings and token type embeddings as the input embeddings, which are then fed into Transformer encoder for joint feature extraction and fusion. The resulting representations are fed to the MLP-only head network for target classification and anchor-free-based bounding box regression. Most network components from the pre-trained ViT model are frozen during training, except for LoRA modules applied on the linear layers in the Transformer encoder, the token type embeddings, and the head network.