Table of Contents
Fetching ...

ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao

TL;DR

ReaMOT addresses the challenge of reasoning-based multi-object tracking by introducing a benchmark (ReaMOT Challenge) that pairs complex language instructions with video scenes across 12 datasets. It proposes ReaTrack, a training-free baseline that combines large vision-language models (LVLM) with SAM2 to reason about targets and track them online, evaluated under zero-shot conditions. The benchmark introduces 1,156 reasoning-rich instructions, 423,359 image-language pairs, and 869 scenes, with Easy/Medium/Hard difficulty levels and a tailored four-metric evaluation (RIDF1, RMOTA, RRcll, RPrcn). Experimental results show ReaTrack achieving state-of-the-art performance across all difficulty levels and metrics, demonstrating strong zero-shot generalization and robustness in reasoning-driven tracking. The work provides a practical baseline and a comprehensive dataset for advancing reasoning-enabled tracking research, while noting limitations in dataset analysis depth and real-time applicability.”

Abstract

Referring Multi-object tracking (RMOT) is an important research field in computer vision. Its task form is to guide the models to track the objects that conform to the language instruction. However, the RMOT task commonly requires clear language instructions, such methods often fail to work when complex language instructions with reasoning characteristics appear. In this work, we propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT). ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects' trajectories. To advance the ReaMOT task and evaluate the reasoning capabilities of tracking models, we construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets. Specifically, it comprises 1,156 language instructions with reasoning characteristic, 423,359 image-language pairs, and 869 diverse scenes, which is divided into three levels of reasoning difficulty. In addition, we propose a set of evaluation metrics tailored for the ReaMOT task. Furthermore, we propose ReaTrack, a training-free framework for reasoning-based multi-object tracking based on large vision-language models (LVLM) and SAM2, as a baseline for the ReaMOT task. Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of our ReaTrack framework.

ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

TL;DR

ReaMOT addresses the challenge of reasoning-based multi-object tracking by introducing a benchmark (ReaMOT Challenge) that pairs complex language instructions with video scenes across 12 datasets. It proposes ReaTrack, a training-free baseline that combines large vision-language models (LVLM) with SAM2 to reason about targets and track them online, evaluated under zero-shot conditions. The benchmark introduces 1,156 reasoning-rich instructions, 423,359 image-language pairs, and 869 scenes, with Easy/Medium/Hard difficulty levels and a tailored four-metric evaluation (RIDF1, RMOTA, RRcll, RPrcn). Experimental results show ReaTrack achieving state-of-the-art performance across all difficulty levels and metrics, demonstrating strong zero-shot generalization and robustness in reasoning-driven tracking. The work provides a practical baseline and a comprehensive dataset for advancing reasoning-enabled tracking research, while noting limitations in dataset analysis depth and real-time applicability.”

Abstract

Referring Multi-object tracking (RMOT) is an important research field in computer vision. Its task form is to guide the models to track the objects that conform to the language instruction. However, the RMOT task commonly requires clear language instructions, such methods often fail to work when complex language instructions with reasoning characteristics appear. In this work, we propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT). ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects' trajectories. To advance the ReaMOT task and evaluate the reasoning capabilities of tracking models, we construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets. Specifically, it comprises 1,156 language instructions with reasoning characteristic, 423,359 image-language pairs, and 869 diverse scenes, which is divided into three levels of reasoning difficulty. In addition, we propose a set of evaluation metrics tailored for the ReaMOT task. Furthermore, we propose ReaTrack, a training-free framework for reasoning-based multi-object tracking based on large vision-language models (LVLM) and SAM2, as a baseline for the ReaMOT task. Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of our ReaTrack framework.

Paper Structure

This paper contains 25 sections, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: The difference between RMOT and ReaMOT tasks. Unlike RMOT, the ReaMOT task requires the models to engage in a deep reasoning process for deducing and tracking the targets.
  • Figure 2: Dataset annotation process. The annotation process includes three steps: (1) Manual Pre-selection; (2) GPT-assisted Annotation; (3) Manual Annotation and Re-checking. First, we draw object bounding boxes in the entire video, select objects with common characteristics, and extract the key frames containing these objects. Then, we input the key frames into GPT and provide instructions for it to analyze the appearance, movement, relationships of the specified targets and differences from other objects. Finally, we extract and summarize the features output by GPT, obtain keywords, and have them reviewed by multiple people to generate the final language instructions.
  • Figure 3: Frames number distribution of language instructions. The number of language instructions and frames corresponding to language instructions at the Easy, Medium, and Hard levels in the ReaMOT Challenge dataset.
  • Figure 4: Object count distribution and category distribution. (a) The proportion and number of language instructions corresponding to number of objects involved in the language instructions in the ReaMOT Challenge dataset; (b) The proportion and number of language instructions corresponding to categories in the ReaMOT Challenge dataset.
  • Figure 5: Word cloud. The word cloud of the ReaMOT Challenge dataset contains a large number of words describing categories, motion features, and orientations.
  • ...and 6 more figures