Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang
TL;DR
This paper tackles MOT25-StAG, a benchmark requiring simultaneous multi-object tracking and grounding to free-form language queries in videos. It introduces a zero-shot, two-stage pipeline that first generates object tracks with FastTracker and then captions each track with LLaVA-Video, using cosine similarity between query and caption for retrieval. The approach achieves competitive results (m-HIoU 20.68, HOTA 10.73) and ranks second, demonstrating that a training-free combination of strong vision and language models can serve as a robust baseline for spatiotemporal video action grounding. The study highlights strengths in retrieval robustness and discusses failure modes related to captioning misalignment and hallucination, suggesting pathways to improve grounding through model tuning.
Abstract
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
