Table of Contents
Fetching ...

CFTrack: Enhancing Lightweight Visual Tracking through Contrastive Learning and Feature Matching

Juntao Liang, Jun Hou, Weijun Zhang, Yong Wang

TL;DR

CFTrack addresses the challenge of achieving high discriminative accuracy in lightweight visual tracking on resource-constrained devices. It introduces a Contrastive Feature Matching module with an adaptive margin and adaptive contrastive loss, integrated into a Siamese backbone to enhance target-background separation and temporal consistency during inference. Extensive experiments on LaSOT, OTB100, UAV123, and HOOT show CFTrack achieves strong accuracy while running at real-time speeds (e.g., 136 fps on Jetson NX and 368 fps on a 3090), outperforming many lightweight trackers. The work demonstrates the practical impact of combining online contrastive learning with feature matching for robust, efficient tracking in edge environments, and suggests a general, plug-and-play potential for similar trackers.

Abstract

Achieving both efficiency and strong discriminative ability in lightweight visual tracking is a challenge, especially on mobile and edge devices with limited computational resources. Conventional lightweight trackers often struggle with robustness under occlusion and interference, while deep trackers, when compressed to meet resource constraints, suffer from performance degradation. To address these issues, we introduce CFTrack, a lightweight tracker that integrates contrastive learning and feature matching to enhance discriminative feature representations. CFTrack dynamically assesses target similarity during prediction through a novel contrastive feature matching module optimized with an adaptive contrastive loss, thereby improving tracking accuracy. Extensive experiments on LaSOT, OTB100, and UAV123 show that CFTrack surpasses many state-of-the-art lightweight trackers, operating at 136 frames per second on the NVIDIA Jetson NX platform. Results on the HOOT dataset further demonstrate CFTrack's strong discriminative ability under heavy occlusion.

CFTrack: Enhancing Lightweight Visual Tracking through Contrastive Learning and Feature Matching

TL;DR

CFTrack addresses the challenge of achieving high discriminative accuracy in lightweight visual tracking on resource-constrained devices. It introduces a Contrastive Feature Matching module with an adaptive margin and adaptive contrastive loss, integrated into a Siamese backbone to enhance target-background separation and temporal consistency during inference. Extensive experiments on LaSOT, OTB100, UAV123, and HOOT show CFTrack achieves strong accuracy while running at real-time speeds (e.g., 136 fps on Jetson NX and 368 fps on a 3090), outperforming many lightweight trackers. The work demonstrates the practical impact of combining online contrastive learning with feature matching for robust, efficient tracking in edge environments, and suggests a general, plug-and-play potential for similar trackers.

Abstract

Achieving both efficiency and strong discriminative ability in lightweight visual tracking is a challenge, especially on mobile and edge devices with limited computational resources. Conventional lightweight trackers often struggle with robustness under occlusion and interference, while deep trackers, when compressed to meet resource constraints, suffer from performance degradation. To address these issues, we introduce CFTrack, a lightweight tracker that integrates contrastive learning and feature matching to enhance discriminative feature representations. CFTrack dynamically assesses target similarity during prediction through a novel contrastive feature matching module optimized with an adaptive contrastive loss, thereby improving tracking accuracy. Extensive experiments on LaSOT, OTB100, and UAV123 show that CFTrack surpasses many state-of-the-art lightweight trackers, operating at 136 frames per second on the NVIDIA Jetson NX platform. Results on the HOOT dataset further demonstrate CFTrack's strong discriminative ability under heavy occlusion.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture of the proposed CFTrack framework. The CFTrack framework consists of three key components: (1) a lightweight backbone for feature extraction, (2) correlation blocks for feature fusion, and (3) a prediction head including three branches for bounding box regression, classification, and contrastive feature matching.
  • Figure 2: Qualitative comparison results of our CFTrack with other three lightweight trackers on UAV123 (Zoom in for better view).
  • Figure 3: Qualitative results under different visibility conditions. Red/blue bounding boxes indicate confidence below/above 0.8.
  • Figure 4: Real-world test results. The tracking results are marked with green and the CLE represents the center location error.