Table of Contents
Fetching ...

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng

TL;DR

This work formalizes Interactive Person Re-Identification (Inter-ReID) as a dialogue-driven retrieval problem where initial witness descriptions are progressively refined through targeted questions. It introduces Interactive-PEDES, a multi-round dialogue dataset built via coarse-to-fine captioning, sub-caption decomposition, and diverse Q&A generation to capture fine-grained attributes. The proposed LLaVA-ReID framework uses a selective multi-image Questioner with a hard-pass visual-context selector and a look-forward supervision strategy to maximize information gain per round, integrated with a CLIP-based Retriever and an Answerer LL model. Empirical results show strong gains on Inter-ReID and beneficial transfer to text-based ReID benchmarks, with ablations validating the value of selective context and look-forward supervision for efficient dialogue-driven refinement.

Abstract

Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

TL;DR

This work formalizes Interactive Person Re-Identification (Inter-ReID) as a dialogue-driven retrieval problem where initial witness descriptions are progressively refined through targeted questions. It introduces Interactive-PEDES, a multi-round dialogue dataset built via coarse-to-fine captioning, sub-caption decomposition, and diverse Q&A generation to capture fine-grained attributes. The proposed LLaVA-ReID framework uses a selective multi-image Questioner with a hard-pass visual-context selector and a look-forward supervision strategy to maximize information gain per round, integrated with a CLIP-based Retriever and an Answerer LL model. Empirical results show strong gains on Inter-ReID and beneficial transfer to text-based ReID benchmarks, with ablations validating the value of selective context and look-forward supervision for efficient dialogue-driven refinement.

Abstract

Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

Paper Structure

This paper contains 30 sections, 10 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: An illustrated example of interactive person re-identification. The red circles highlight the distinctive details in the candidate images that the inquiry process needs to focus on.
  • Figure 2: Illustration of our automated dialogue data construction pipeline. Step 1: Generate coarse and fine-grained descriptions. Step 2: Decompose follow-up descriptions into distinct attributes. Step 3: Formulate diverse Q&A pairs.
  • Figure 3: (Left) The framework of interactive person re-identification. The Retriever encodes gallery images and the description, providing retrieval results and the relevant candidates to the Questioner. The Questioner generates discriminative questions based on the description and the candidates. The Witness provides the corresponding information in response to these questions. (Right) The architecture of the selector. The selector chooses the most representative candidates from the top-$k$ person based on textual information.
  • Figure 4: Retrieval performance v.s. Number of queries. The solid line denotes the R@1 and the dashed line denotes the R@5.
  • Figure 5: Distribution of description length (Left) and the dialogue round (Right) in our dataset.
  • ...and 3 more figures