LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification
Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng
TL;DR
This work formalizes Interactive Person Re-Identification (Inter-ReID) as a dialogue-driven retrieval problem where initial witness descriptions are progressively refined through targeted questions. It introduces Interactive-PEDES, a multi-round dialogue dataset built via coarse-to-fine captioning, sub-caption decomposition, and diverse Q&A generation to capture fine-grained attributes. The proposed LLaVA-ReID framework uses a selective multi-image Questioner with a hard-pass visual-context selector and a look-forward supervision strategy to maximize information gain per round, integrated with a CLIP-based Retriever and an Answerer LL model. Empirical results show strong gains on Inter-ReID and beneficial transfer to text-based ReID benchmarks, with ablations validating the value of selective context and look-forward supervision for efficient dialogue-driven refinement.
Abstract
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.
