Table of Contents
Fetching ...

OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer

Jinyang Li, En Yu, Sijia Chen, Wenbing Tao

TL;DR

OVTR tackles open-vocabulary multiple object tracking by learning a continuous, end-to-end framework that propagates category information across frames. It introduces a Category Information Propagation (CIP) mechanism and a dual-branch decoder to fuse CLIP-aligned image and text features, along with decoder protection to stabilize classification and tracking. The method achieves state-of-the-art open-vocabulary performance on TAO with faster inference and reduced preprocessing, and demonstrates strong cross-dataset transfer to KITTI. The contributions include a novel end-to-end architecture, CIP strategy, protective decoder design, and alignment-based multimodal fusion.

Abstract

Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at https://github.com/jinyanglii/OVTR.

OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer

TL;DR

OVTR tackles open-vocabulary multiple object tracking by learning a continuous, end-to-end framework that propagates category information across frames. It introduces a Category Information Propagation (CIP) mechanism and a dual-branch decoder to fuse CLIP-aligned image and text features, along with decoder protection to stabilize classification and tracking. The method achieves state-of-the-art open-vocabulary performance on TAO with faster inference and reduced preprocessing, and demonstrates strong cross-dataset transfer to KITTI. The contributions include a novel end-to-end architecture, CIP strategy, protective decoder design, and alignment-based multimodal fusion.

Abstract

Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker is constrained by its framework structure, isolated frame-level perception, and insufficient modal interactions, which hinder its performance in open-vocabulary classification and tracking. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we design the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Experimental results show that our method surpasses previous trackers on the open-vocabulary MOT benchmark while also achieving faster inference speeds and significantly reducing preprocessing requirements. Moreover, the experiment transferring the model to another dataset demonstrates its strong adaptability. Models and code are released at https://github.com/jinyanglii/OVTR.

Paper Structure

This paper contains 29 sections, 6 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison of tracking-by-OVD and our method. Tracking-by-OVD predicts each frame independently, making classification and association susceptible to changes in appearance. In contrast, our method, OVTR, propagates location, appearance, and category information from the current frame to subsequent frames, creating a stable, continuously updated information flow. This flow serves as a prior, aiding in capturing the corresponding target in future frames.
  • Figure 2: Overview of OVTR. OVTR processes two modalities of input, with modality interaction structures in both the encoder and decoder. The dual-branch decoder has the OFA branch, which serves as the medium through which the CLIP image encoder guides our model to achieve visual generalization capabilities, and the CTI branch, which handles open-vocabulary interaction and classification. The updated outputs are used as prior queries for the next frame's predictions.
  • Figure 3: Architectures of the dual-branch decoder and the encoder. After modality fusion in the encoder, the resulting image and text features are separately fed into the decoder's Image Cross-Attention and Text Cross-Attention for interactions. Aligned queries are processed by the OFA and CTI branches to generate bounding boxes $B$, alignment features $F_\text{align}$, and branch outputs $O_\text{txt}$.
  • Figure 4: Attention isolation masks. In the difference matrix, The darker areas indicate a smaller KL divergence, meaning the category prediction distributions of the corresponding queries in the current layer are more similar. This suggests that the category information of the corresponding input queries passed to the next layer is similar. The darker areas of the masks represent masked positions, while the red dashed box shows that interactions among track queries will be maintained.
  • Figure 5: OVTR data augmentations. Unlike OVTrack, which is based on appearance matching, our method does not utilize diffusion models for data augmentation. Instead, we propose Dynamic Mosaic and Random Occlusion data augmentation to simulate object appearance and disappearance, tracking continuity after occlusion, and maintaining correct associations when relative motion occurs between tracked objects and others.
  • ...and 5 more figures