Find Them All: Unveiling MLLMs for Versatile Person Re-identification
Jinhao Li, Zijian Chen, Lirong Deng, Guangtao Zhai, Changbo Wang
TL;DR
The paper addresses the challenge of cross-modal person re-identification by evaluating how multi-modal large language models can directly perform retrieval across ten heterogeneous modalities. It introduces the VP-ReID benchmark with two evaluation schemes, MCQ and QGM, combining thousands of queries and galleries to probe cross-modal reasoning and retrieval. Through extensive experiments with both proprietary and open-source MLLMs alongside traditional ReID baselines, the study reveals strong cross-modal capabilities in several modalities but persistent gaps in thermal and infrared data, highlighting modality-specific limitations. The work also analyzes scaling behavior, interpretability, and practical acceleration via vLLM, offering a roadmap for developing robust cross-modal foundation models for versatile person ReID in real-world settings.
Abstract
Person re-identification (ReID) aims to retrieve images of a target person from the gallery set, with wide applications in medical rehabilitation and public security. However, traditional person ReID models are typically uni-modal, resulting in limited generalizability across heterogeneous data modalities. Recently, the emergence of multi-modal large language models (MLLMs) has shown a promising avenue for addressing this issue. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, leaving their capabilities in person ReID tasks largely unexplored. To bridge this gap, we introduce a novel benchmark for \underline{\textbf{V}}ersatile \underline{\textbf{P}}erson \underline{\textbf{Re}}-\underline{\textbf{ID}}entification, termed VP-ReID. The benchmark includes 257,310 multi-modal queries and gallery images, covering ten diverse person ReID tasks. In addition, we propose two task-oriented evaluation schemes for MLLM-based person ReID. Extensive experiments demonstrate the impressive versatility, effectiveness, and interpretability of MLLMs in various person ReID tasks. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope that VP-ReID can facilitate the community in developing more robust and generalizable cross-modal foundation models for person ReID.
