Table of Contents
Fetching ...

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

Zekun Qian, Ruize Han, Junhui Hou, Linqi Song, Wei Feng

TL;DR

This paper considers the tracking-related state of the objects during tracking and proposes a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects.

Abstract

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

TL;DR

This paper considers the tracking-related state of the objects during tracking and proposes a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects.

Abstract

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison between prior li2023ovtrack and our methods.
  • Figure 2: Training framework consists of three parts: first is the localization head used to localize objects of all categories in the video as region candidates; the second is the CLIP distilled classification head consisting of image and text branches, which uses tracking-sate-aware prompts to guide the model in focusing on object states while learning classification features, thereby better distinguishing the OV categories; and the third part is the association head that utilizes intra/inter-consistency between the same objects in different frames to learn association features in a self-supervised way.
  • Figure 3: Illustration of regions with high (a) or low (b) prompt-guided attention, respectively.
  • Figure 4: Compared OVMOT results of ours and OVTrack on some cases with novel classes.
  • Figure 5: Failure case illustration.
  • ...and 1 more figures