Table of Contents
Fetching ...

Find Them All: Unveiling MLLMs for Versatile Person Re-identification

Jinhao Li, Zijian Chen, Lirong Deng, Guangtao Zhai, Changbo Wang

TL;DR

The paper addresses the challenge of cross-modal person re-identification by evaluating how multi-modal large language models can directly perform retrieval across ten heterogeneous modalities. It introduces the VP-ReID benchmark with two evaluation schemes, MCQ and QGM, combining thousands of queries and galleries to probe cross-modal reasoning and retrieval. Through extensive experiments with both proprietary and open-source MLLMs alongside traditional ReID baselines, the study reveals strong cross-modal capabilities in several modalities but persistent gaps in thermal and infrared data, highlighting modality-specific limitations. The work also analyzes scaling behavior, interpretability, and practical acceleration via vLLM, offering a roadmap for developing robust cross-modal foundation models for versatile person ReID in real-world settings.

Abstract

Person re-identification (ReID) aims to retrieve images of a target person from the gallery set, with wide applications in medical rehabilitation and public security. However, traditional person ReID models are typically uni-modal, resulting in limited generalizability across heterogeneous data modalities. Recently, the emergence of multi-modal large language models (MLLMs) has shown a promising avenue for addressing this issue. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, leaving their capabilities in person ReID tasks largely unexplored. To bridge this gap, we introduce a novel benchmark for \underline{\textbf{V}}ersatile \underline{\textbf{P}}erson \underline{\textbf{Re}}-\underline{\textbf{ID}}entification, termed VP-ReID. The benchmark includes 257,310 multi-modal queries and gallery images, covering ten diverse person ReID tasks. In addition, we propose two task-oriented evaluation schemes for MLLM-based person ReID. Extensive experiments demonstrate the impressive versatility, effectiveness, and interpretability of MLLMs in various person ReID tasks. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope that VP-ReID can facilitate the community in developing more robust and generalizable cross-modal foundation models for person ReID.

Find Them All: Unveiling MLLMs for Versatile Person Re-identification

TL;DR

The paper addresses the challenge of cross-modal person re-identification by evaluating how multi-modal large language models can directly perform retrieval across ten heterogeneous modalities. It introduces the VP-ReID benchmark with two evaluation schemes, MCQ and QGM, combining thousands of queries and galleries to probe cross-modal reasoning and retrieval. Through extensive experiments with both proprietary and open-source MLLMs alongside traditional ReID baselines, the study reveals strong cross-modal capabilities in several modalities but persistent gaps in thermal and infrared data, highlighting modality-specific limitations. The work also analyzes scaling behavior, interpretability, and practical acceleration via vLLM, offering a roadmap for developing robust cross-modal foundation models for versatile person ReID in real-world settings.

Abstract

Person re-identification (ReID) aims to retrieve images of a target person from the gallery set, with wide applications in medical rehabilitation and public security. However, traditional person ReID models are typically uni-modal, resulting in limited generalizability across heterogeneous data modalities. Recently, the emergence of multi-modal large language models (MLLMs) has shown a promising avenue for addressing this issue. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, leaving their capabilities in person ReID tasks largely unexplored. To bridge this gap, we introduce a novel benchmark for \underline{\textbf{V}}ersatile \underline{\textbf{P}}erson \underline{\textbf{Re}}-\underline{\textbf{ID}}entification, termed VP-ReID. The benchmark includes 257,310 multi-modal queries and gallery images, covering ten diverse person ReID tasks. In addition, we propose two task-oriented evaluation schemes for MLLM-based person ReID. Extensive experiments demonstrate the impressive versatility, effectiveness, and interpretability of MLLMs in various person ReID tasks. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope that VP-ReID can facilitate the community in developing more robust and generalizable cross-modal foundation models for person ReID.

Paper Structure

This paper contains 18 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of different person re-identification (ReID) paradigms. Traditional methods typically matches a query, e.g., a suspect portrait or textual description, against a gallery by comparing high-level features. Similarly, existing MLLM-based methods utilize MLLMs to refine an initial description to gradually align with the target. However, both conventional paradigms predominantly concentrate on a few modalities (e.g., RGB and text), limiting their applicability in diverse real-world scenarios. In this work, we conduct a holistic evaluation for MLLMs on ten different modalities to verify their capabilities in handling person ReID tasks directly.
  • Figure 2: Construction pipeline of the VP-ReID. We first collect source contents from ten person ReID datasets covering diverse modalities. Then, queries and gallery images are sampled based on the two proposed evaluation schemes: MCQ and QGM, resulting in a total of 257,310 queries and gallery images. Finally, we organize the data into question–answer pairs for the MCQ and extract labels for the QGM.
  • Figure 3: Performance ( CMC) comparisons between traditional models and MLLMs on ten person ReID tasks under the QGM setting.
  • Figure 4: Scaling behavior of different model families on our VP-ReID. We report their averaged Rank@1 and mAP on eight tasks.
  • Figure 5: Correlation across different tasks in VP-ReID.
  • ...and 5 more figures