Table of Contents
Fetching ...

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci

TL;DR

This work tackles VLM personalization without training by introducing R2P, a training-free Retrieval and Reasoning framework that builds a personal multimodal database enriched with fingerprint attributes derived from the VLM. At inference, R2P performs multimodal retrieval to shortlist candidates, then uses attribute-focused Chain-of-Thought reasoning and cross-modal verification to identify the best match, with optional pairwise reasoning for difficult cases. A new PerVA dataset is proposed to stress test personalization under visual ambiguity, alongside extensive experiments showing R2P achieving state-of-the-art performance on MyVLM, Yo’LLaVA, and PerVA across recognition, captioning, and personalized VQA tasks. The approach reduces reliance on costly retraining while robustly handling visually similar personal concepts, highlighting practical potential for user-specific VLM customization. Future work will target efficiency improvements and handling cluttered scenes with near-identity objects.

Abstract

Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

TL;DR

This work tackles VLM personalization without training by introducing R2P, a training-free Retrieval and Reasoning framework that builds a personal multimodal database enriched with fingerprint attributes derived from the VLM. At inference, R2P performs multimodal retrieval to shortlist candidates, then uses attribute-focused Chain-of-Thought reasoning and cross-modal verification to identify the best match, with optional pairwise reasoning for difficult cases. A new PerVA dataset is proposed to stress test personalization under visual ambiguity, alongside extensive experiments showing R2P achieving state-of-the-art performance on MyVLM, Yo’LLaVA, and PerVA across recognition, captioning, and personalized VQA tasks. The approach reduces reliance on costly retraining while robustly handling visually similar personal concepts, highlighting practical potential for user-specific VLM customization. Future work will target efficiency improvements and handling cluttered scenes with near-identity objects.

Abstract

Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.

Paper Structure

This paper contains 24 sections, 8 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Current VLMs personalization methods depend on expensive training procedures. In contrast, we introduce R2P, a training-free approach that utilizes textual attributes as unique fingerprints for identifying personal concepts.
  • Figure 2: R2P is the first training-free method to address VLM personalization, aiming to recognize the personal concept from a query image. R2P consists of two phases. First, in the the personal database $\mathcal{D}$ creation phase, we leverage the VLM $\Phi_\mathtt{VLM}$ to enrich personal concepts with their distinctive fingerprint attributes. Then, in the inference phase, relevant concepts are retrieved from the personal database, and the best matched personal concept $\tilde{c}$ is obtained with focused reasoning based on images and fingerprint attributes.
  • Figure 3: Qualitative visualization of concepts for the proposed PerVA dataset. In order, samples from bottles, towels and clothes. Top: reference images for personalization with their concept indicated above. Bottom: query images at inference time.
  • Figure 4: Qualitative examples. Given a query image and a user prompt (left), R2P retrieves the most similar Top-K concepts, analyzes a set of fingerprint attributes, and generates a precise, personalized caption. The key attributes enabling the model to recognize the correct concept name (bold) are underlined for clarity.
  • Figure 5: Qualitative example of the Concept Inference with Retrieval-Reasoning of R2P
  • ...and 3 more figures