Table of Contents
Fetching ...

IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

Run Luo, Zikai Song, Longze Chen, Yunshui Li, Min Yang, Wei Yang

TL;DR

IP-MOT is developed, an end-to-end transformer model for MOT that operates without concrete textual descriptions that achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

Abstract

Multi-Object Tracking (MOT) aims to associate multiple objects across video frames and is a challenging vision task due to inherent complexities in the tracking environment. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability to data from other domains. While several works have introduced natural language representation to bridge the domain gap in visual tracking, these textual descriptions often provide too high-level a view and fail to distinguish various instances within the same class. In this paper, we address this limitation by developing IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions. Our approach is underpinned by two key innovations: Firstly, leveraging a pre-trained vision-language model, we obtain instance-level pseudo textual descriptions via prompt-tuning, which are invariant across different tracking scenes; Secondly, we introduce a query-balanced strategy, augmented by knowledge distillation, to further boost the generalization capabilities of our model. Extensive experiments conducted on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach not only achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

TL;DR

IP-MOT is developed, an end-to-end transformer model for MOT that operates without concrete textual descriptions that achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

Abstract

Multi-Object Tracking (MOT) aims to associate multiple objects across video frames and is a challenging vision task due to inherent complexities in the tracking environment. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability to data from other domains. While several works have introduced natural language representation to bridge the domain gap in visual tracking, these textual descriptions often provide too high-level a view and fail to distinguish various instances within the same class. In this paper, we address this limitation by developing IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions. Our approach is underpinned by two key innovations: Firstly, leveraging a pre-trained vision-language model, we obtain instance-level pseudo textual descriptions via prompt-tuning, which are invariant across different tracking scenes; Secondly, we introduce a query-balanced strategy, augmented by knowledge distillation, to further boost the generalization capabilities of our model. Extensive experiments conducted on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach not only achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

Paper Structure

This paper contains 16 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: IP-MOT. We propose IP-MOT, which further improves the generalization ability of the model by using a online learnable TrackBook instead of a manually designed TrackBook to obtain a more fine-grained instance-level textual description. Meanwhile, a query balanced strategy (QBS) is also proposed to further improve the tracking and detection accuracy of IP-MOT for cross-domain and some-domain inputs.
  • Figure 2: The overall architecture of IP-MOT. We use different colors to indicate different tracked targets, and the same color represents the same target. In each iteration, we first optimize our trainable TrackBook to obtain a instance-level textual description based on the target in a clip of video stream. Then, adopt a ResNet-50 ResNet backbone and a Transformer Transformer Encoder to learn a 2D representation of an input image. Afterward, the Decoder processes the detect query $Q_{det}$ and track $Q_{tck}$ , and generates the detect output embedding $O_{det}$ and track output embedding $O_{tck}$, respectively. Finally, we add output embedding into the clip-level embedding pool and align them with the corresponding frozen textual description presentation. Since the query balanced strategy (QBS) is used to alleviate unfair label assignment conflict, we designe a simple and elegant deduplication module (DEM) to duplicate detection results.
  • Figure 3: The structure of deduplication module. In the inference stage, we only restrain deduplicated objects by calculating the geometric mean of the classification score and the deduplication score to obtain the tracking score. Then in the subsequent QIM module, we keep newborn objects and drop exited objects based on the tracking score.
  • Figure 4: Visualization of track Output Embedding $O_{tck}$ (the first 50 frames in sequence MOT20-02 on cross-domain benchmark) by using t-Distributed Stochastic Neighbor Embedding (t-SNE). Embeddings for different targets are marked in different colors and shapes. Our method (\ref{['fig:vis-IP-MOT']}) helps the model learn a more stable and distinguishable representation than MOTR (\ref{['fig:vis-motr']}) for the track output embedding. Corresponding tracking performance is shown in Table \ref{['tab:align-qbs']}. Visualization of \ref{['fig:vis-dem-mot17']}, \ref{['fig:vis-dem-dance']} shows IP-MOT track query box prediction highly overlaps the detect query box prediction on the same-domain MOT17 and DanceTrack test set respectively, and the corresponding query self-attention map shows a clear exchange of information between the dedupliacted detect query and the track query of the same instance, demonstrating the effectiveness of our DEM.
  • Figure 5: Visualization of instance-level textual description. Since the Stable Diffuion StableDiffusion and the CLIP CLIP share a same text encoder, we can generate the corresponding image based on the instance-level textual description. Target* means the original target in MOT17 dataset, while target, color, and texture means corresponding synthetic image by replacing the last word in textual description with "person","dog", and "cat", respectively.