Table of Contents
Fetching ...

Human Re-ID Meets LVLMs: What can we expect?

Kailash Hambarde, Pranita Samale, Hugo Proença

TL;DR

The paper investigates the applicability of large vision-language models (LVLMs) to Human Re-Identification by benchmarking ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max against a specialized baseline (PersonViT) on Market1501, using a pipeline of dataset curation, prompt engineering, and multi-metric evaluation. Due to LVLMs often producing identical similarity scores, traditional ReID metrics like rank-1 and mAP are unreliable, so the authors rely on impostor/genuine score distributions, the decidability index $d'$ and classification metrics plus ROC-AUC to assess performance. Findings show LVLMs offer some interpretability and flexibility but exhibit limited discriminative power, especially in batch settings, and some models initially refuse ReID tasks for privacy concerns. The study suggests future work on integrated architectures that fuse LVLMs with specialized ReID methods to leverage complementary strengths and mitigate catastrophic failures in surveillance contexts.

Abstract

Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric selection to assess the models' performance. Results are analyzed from many different perspectives: similarity scores, classification accuracy, and classification metrics, including precision, recall, F1 score, and area under curve (AUC). Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers and should be the scope of further research. As a concluding remark, we speculate about some further research that should fuse traditional and LVLMs to combine the strengths from both families of techniques and achieve solid improvements in performance.

Human Re-ID Meets LVLMs: What can we expect?

TL;DR

The paper investigates the applicability of large vision-language models (LVLMs) to Human Re-Identification by benchmarking ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max against a specialized baseline (PersonViT) on Market1501, using a pipeline of dataset curation, prompt engineering, and multi-metric evaluation. Due to LVLMs often producing identical similarity scores, traditional ReID metrics like rank-1 and mAP are unreliable, so the authors rely on impostor/genuine score distributions, the decidability index and classification metrics plus ROC-AUC to assess performance. Findings show LVLMs offer some interpretability and flexibility but exhibit limited discriminative power, especially in batch settings, and some models initially refuse ReID tasks for privacy concerns. The study suggests future work on integrated architectures that fuse LVLMs with specialized ReID methods to leverage complementary strengths and mitigate catastrophic failures in surveillance contexts.

Abstract

Large vision-language models (LVLMs) have been regarded as a breakthrough advance in an astoundingly variety of tasks, from content generation to virtual assistants and multimodal search or retrieval. However, for many of these applications, the performance of these methods has been widely criticized, particularly when compared with state-of-the-art methods and technologies in each specific domain. In this work, we compare the performance of the leading large vision-language models in the human re-identification task, using as baseline the performance attained by state-of-the-art AI models specifically designed for this problem. We compare the results due to ChatGPT-4o, Gemini-2.0-Flash, Claude 3.5 Sonnet, and Qwen-VL-Max to a baseline ReID PersonViT model, using the well-known Market1501 dataset. Our evaluation pipeline includes the dataset curation, prompt engineering, and metric selection to assess the models' performance. Results are analyzed from many different perspectives: similarity scores, classification accuracy, and classification metrics, including precision, recall, F1 score, and area under curve (AUC). Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers and should be the scope of further research. As a concluding remark, we speculate about some further research that should fuse traditional and LVLMs to combine the strengths from both families of techniques and achieve solid improvements in performance.

Paper Structure

This paper contains 16 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Experimental workflow for evaluating LVLM for ReID tasks.
  • Figure 2: An illustrative representation of the quality indices for selected LVLM. The chart highlights the comparative strengths of models across key benchmarks.
  • Figure 3: Comparison of similarity score distributions for a single query image (Q) retrieved using PersonViT (left) and the LVLM Qwen-VL-Max (right). PersonViT assigns distinct similarity scores, enabling effective ranking of gallery images as shown by score distribution. Conversely Qwen-VL-Max assigns nearly identical scores across most gallery images as illustrated in the histogram. This lack of differentiation in scores complicates the calculation of ReID metrics such as rank-1 accuracy and mAP.
  • Figure 4: Comparison of similarity score distributions for PersonViT (top left) and four LVLMs (columns), shown in both pairwise (top row) and batch (bottom row). Each histogram depicts genuine pairs (green) and impostor pairs (red) with superimposed density curves. The orange marker in each plot indicates the decidability index which reflects how effectively genuine pairs are separated from impostors. PersonViT trained specifically for ReID displays a clear separation between the two distributions. In contrast the LVLMs exhibit various degrees of overlap with some showing more pronounced difficulty when moving from pairwise to batch evaluations.
  • Figure 5: ROC curves for pairwise (\ref{['fig:roc_pairwise']}) and batch (\ref{['fig:roc_batchwise']}) evaluations. Each plot shows the true positive rate (TPR) versus the false positive rate (FPR) for each model with the dashed diagonal indicating random chance performance. PersonViT (yellow) demonstrates the largest AUC reflecting its specialized ReID design. The LVLMs exhibit varying performance levels and certain models show notable changes when transitioning from pairwise to batch mode.
  • ...and 3 more figures