Table of Contents
Fetching ...

Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

TL;DR

This paper tackles MOT25-StAG, a benchmark requiring simultaneous multi-object tracking and grounding to free-form language queries in videos. It introduces a zero-shot, two-stage pipeline that first generates object tracks with FastTracker and then captions each track with LLaVA-Video, using cosine similarity between query and caption for retrieval. The approach achieves competitive results (m-HIoU 20.68, HOTA 10.73) and ranks second, demonstrating that a training-free combination of strong vision and language models can serve as a robust baseline for spatiotemporal video action grounding. The study highlights strengths in retrieval robustness and discusses failure modes related to captioning misalignment and hallucination, suggesting pathways to improve grounding through model tuning.

Abstract

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

TL;DR

This paper tackles MOT25-StAG, a benchmark requiring simultaneous multi-object tracking and grounding to free-form language queries in videos. It introduces a zero-shot, two-stage pipeline that first generates object tracks with FastTracker and then captions each track with LLaVA-Video, using cosine similarity between query and caption for retrieval. The approach achieves competitive results (m-HIoU 20.68, HOTA 10.73) and ranks second, demonstrating that a training-free combination of strong vision and language models can serve as a robust baseline for spatiotemporal video action grounding. The study highlights strengths in retrieval robustness and discusses failure modes related to captioning misalignment and hallucination, suggesting pathways to improve grounding through model tuning.

Abstract

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Paper Structure

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our two-stage training-free framework for spatiotemporal action grounding. In the first stage, we track all the objects with a pre-trained tracking model FastTracker, and generate tracking results, one video for each track. In the second stage, we caption the resulting tracking result videos with LLaVA-Video. Videos are retrieved based on the similarity of the caption and the query.
  • Figure 2: A sample video caption generated by LLaVA-Video, instructed to focus on the action of the person tracked and highlighted in a bounding box.
  • Figure 3: A failure case of video captioning. The person's precise action of moving towards the camera is not captured, and there is a hallucination that "the person is captured in a single frame".