Table of Contents
Fetching ...

Coordinate-Aware Thermal Infrared Tracking Via Natural Language Modeling

Miao Yan, Ping Zhang, Haofei Zhang, Ruqian Hao, Juanxiu Liu, Xiaoyang Wang, Lin Liu

TL;DR

This work tackles robust thermal infrared tracking under low-texture conditions by reframing the task as coordinate sequence generation using a natural language modeling framework. The proposed NLMTrack architecture combines a Transformer-based encoder that unifies feature extraction and fusion, a multilevel progressive fusion module for rich multi-scale semantics, and a causal Transformer decoder that autoregressively outputs coordinate tokens $[cmd, x, y, w, h]$ discretized into $nbins$ with a shared vocabulary $|V|=nbins$. The model is trained with a joint cross-entropy and SIOU loss, and employs a simple dynamic template update strategy guided by token-level confidence, achieving state-of-the-art results on VOT-TIR2015, PTB-TIR, and LSOTB-TIR benchmarks. The results demonstrate the effectiveness of integrating temporal and coordinate information within a unified framework, improving robustness to occlusion, scale variation, and background interference, with practical implications for real-world TIR tracking tasks and potential multi-task extensions.

Abstract

Thermal infrared (TIR) tracking is pivotal in computer vision tasks due to its all-weather imaging capability. Traditional tracking methods predominantly rely on hand-crafted features, and while deep learning has introduced correlation filtering techniques, these are often constrained by rudimentary correlation operations. Furthermore, transformer-based approaches tend to overlook temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a coordinate-aware thermal infrared tracking model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{https://github.com/ELOESZHANG/NLMTrack}.

Coordinate-Aware Thermal Infrared Tracking Via Natural Language Modeling

TL;DR

This work tackles robust thermal infrared tracking under low-texture conditions by reframing the task as coordinate sequence generation using a natural language modeling framework. The proposed NLMTrack architecture combines a Transformer-based encoder that unifies feature extraction and fusion, a multilevel progressive fusion module for rich multi-scale semantics, and a causal Transformer decoder that autoregressively outputs coordinate tokens discretized into with a shared vocabulary . The model is trained with a joint cross-entropy and SIOU loss, and employs a simple dynamic template update strategy guided by token-level confidence, achieving state-of-the-art results on VOT-TIR2015, PTB-TIR, and LSOTB-TIR benchmarks. The results demonstrate the effectiveness of integrating temporal and coordinate information within a unified framework, improving robustness to occlusion, scale variation, and background interference, with practical implications for real-world TIR tracking tasks and potential multi-task extensions.

Abstract

Thermal infrared (TIR) tracking is pivotal in computer vision tasks due to its all-weather imaging capability. Traditional tracking methods predominantly rely on hand-crafted features, and while deep learning has introduced correlation filtering techniques, these are often constrained by rudimentary correlation operations. Furthermore, transformer-based approaches tend to overlook temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a coordinate-aware thermal infrared tracking model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{https://github.com/ELOESZHANG/NLMTrack}.
Paper Structure (29 sections, 6 equations, 6 figures, 5 tables)

This paper contains 29 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our NLMTrack framework. The overall framework consists of an encoder, a multilevel progressive fusion module, and a decoder with a causal Transformer. $Z_p$ and $X_p$ refer to TIR features extracted from the template and the search region, respectively. $f_x$ denotes the search region features from the encoder’s output.
  • Figure 2: Detailed architecture of the multilevel progressive fusion module. The module generates features at different levels through a simple feature pyramid and progressively fuses cross-semantic features using our fusion modules UpFusion and DownFusion.
  • Figure 3: Illustration of the structure of commonly used fusion methods with our proposed method. (a) Fusion in the form of concatenation; (b) Fusion in the form of element-wise addition operation; (c) Progressive fusion approach.
  • Figure 4: Attribute-based evaluation of NLMTrack on PTB-TIR benchmark.
  • Figure 5: Attribute-based evaluation of NLMTrack on LSOTB-TIR benchmark.
  • ...and 1 more figures