Table of Contents
Fetching ...

VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

Jun Du

Abstract

Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.

VSD-MOT: End-to-End Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Distillation

Abstract

Existing multi-object tracking algorithms typically fail to adequately address the issues in low-quality videos, resulting in a significant decline in tracking performance when image quality deteriorates in real-world scenarios. This performance degradation is primarily due to the algorithms' inability to effectively tackle the problems caused by information loss in low-quality images. To address the challenges of low-quality video scenarios, inspired by vision-language models, we propose a multi-object tracking framework guided by visual semantic distillation (VSD-MOT). Specifically, we introduce the CLIP Image Encoder to extract global visual semantic information from images to compensate for the loss of information in low-quality images. However, direct integration can substantially impact the efficiency of the multi-object tracking algorithm. Therefore, this paper proposes to extract visual semantic information from images through knowledge distillation. This method adopts a teacher-student learning framework, with the CLIP Image Encoder serving as the teacher model. To enable the student model to acquire the capability of extracting visual semantic information suitable for multi-object tracking tasks from the teacher model, we have designed the Dual-Constraint Semantic Distillation method (DCSD). Furthermore, to address the dynamic variation of frame quality in low-quality videos, we propose the Dynamic Semantic Weight Regulation (DSWR) module, which adaptively allocates fusion weights based on real-time frame quality assessment. Extensive experiments demonstrate the effectiveness and superiority of the proposed method in low-quality video scenarios in the real world. Meanwhile, our method can maintain good performance in conventional scenarios.
Paper Structure (20 sections, 17 equations, 7 figures, 8 tables)

This paper contains 20 sections, 17 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison between (a) the direct integration of the CLIP Image Encoder process and (b) our proposed visual semantic distillation tracking process. The direct integration approach significantly impacts the efficiency of multi-object tracking algorithms. The visual semantic distillation tracking employs a teacher-student learning framework, adding minimal parameters to learn invariant visual semantic information, thereby enhancing tracking performance while ensuring algorithmic efficiency.
  • Figure 2: The overall architecture of VSD-MOT. The frozen CLIP Image Encoder extracts global visual semantic information from images, and proposals generated by the detector YOLOX are used to produce proposal queries. Tracking inquiries are transferred from the preceding frame for the purpose of forecasting the bounding boxes of tracked objects. The combination of proposal queries and track queries generates query vectors. Through the Dual-Constraint Semantic Distillation (DCSD) method, the student model can learn from the CLIP Image Encoder the capability to extract semantic information that is adaptive to multi-object tracking tasks. The semantic information extracted by the student model is integrated with query vectors through the Dynamic Semantic Weight Regulation (DSWR) module to generate adaptively fused vectors. Both the fused vectors and the image features are fed into the tracker to produce predictions frame by frame.
  • Figure 3: Structure of Dual-Constrained Semantic Distillation. To enable the student model to better extract semantic information adapted to the multi-object tracking task, the core of this module implements a dual-constraint mechanism via two complementary losses, which balances local feature matching and global semantic consistency.
  • Figure 4: Structure of student model. It employs a Transformer Encoder architecture to process the query vectors of the multi-object tracking algorithm.
  • Figure 5: Network architecture of Dynamic Semantic Weight Regulation Module. The module evaluates frame quality through clarity, noise level, and contrast metrics, then generates adaptive fusion weights for combining visual semantic features with query vector features.
  • ...and 2 more figures