Table of Contents
Fetching ...

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

Andong Lu, Jiacong Zhao, Chenglong Li, Yun Xiao, Bin Luo

TL;DR

This work proposes a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking, and introduces two student networks and employs the style distillation loss to make their style features consistent as much as possible.

Abstract

Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS. Code available at https://github.com/Multi-Modality-Tracking/CKD.

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation

TL;DR

This work proposes a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking, and introduces two student networks and employs the style distillation loss to make their style features consistent as much as possible.

Abstract

Modality gap between RGB and thermal infrared (TIR) images is a crucial issue but often overlooked in existing RGBT tracking methods. It can be observed that modality gap mainly lies in the image style difference. In this work, we propose a novel Coupled Knowledge Distillation framework called CKD, which pursues common styles of different modalities to break modality gap, for high performance RGBT tracking. In particular, we introduce two student networks and employ the style distillation loss to make their style features consistent as much as possible. Through alleviating the style difference of two student networks, we can break modality gap of different modalities well. However, the distillation of style features might harm to the content representations of two modalities in student networks. To handle this issue, we take original RGB and TIR networks as the teachers, and distill their content knowledge into two student networks respectively by the style-content orthogonal feature decoupling scheme. We couple the above two distillation processes in an online optimization framework to form new feature representations of RGB and thermal modalities without modality gap. In addition, we design a masked modeling strategy and a multi-modal candidate token elimination strategy into CKD to improve tracking robustness and efficiency respectively. Extensive experiments on five standard RGBT tracking datasets validate the effectiveness of the proposed method against state-of-the-art methods while achieving the fastest tracking speed of 96.4 FPS. Code available at https://github.com/Multi-Modality-Tracking/CKD.

Paper Structure

This paper contains 30 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of performance and speed for state-of-the-art tracking methods on LasHeR li2021lasher. We visualize the Precision Rate (PR) to the Frames Per Second (FPS). CKD is able to rank the 1st in PR while running at 96.4 FPS.
  • Figure 2: Illustration of the influence of modality style on modality gap. Here, (a) denotes the feature distribution of the two modalities, and (b) denotes the feature distribution of the two modalities after removing the style information using instance normalization.
  • Figure 3: Overall architecture of the proposed CKD. It mainly consists of a four-branch network, three tracking heads, and a coupled distillation framework. The four-branch network extracts visual features from the input video frames and performs style distillation and content distillation in the coupled distillation framework.
  • Figure 4: Attribute-based evaluation on RGBT234 in terms of SR metric. CKD achieves the best performance on all attribute splits. Axes of each attribute have been normalized.
  • Figure 5: Ablation study of loss weights on LasHeR dataset.
  • ...and 1 more figures