Table of Contents
Fetching ...

Personalized Visual Instruction Tuning

Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, Tong Zhang

TL;DR

PVIT introduces an in-context personalization paradigm for multimodal language models, enabling them to conduct personalized conversations about arbitrary individuals without fine-tuning. It builds PVIT through a three-phase automatic data-generation pipeline (visual concept curation, dual-level textual fusion, and PVIT dataset generation) to create PVIT-3M and a companion benchmark P-Bench. Results show that PVIT markedly improves personalized perception and dialogue, outperforming state-of-the-art MLLMs on both recognition and description tasks, and robustness grows with data scale and name diversity. The work enables realistic personalized visual assistants and domestic robots by embedding individualized reasoning into MLLMs via prefixes and wrapper tokens.

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.

Personalized Visual Instruction Tuning

TL;DR

PVIT introduces an in-context personalization paradigm for multimodal language models, enabling them to conduct personalized conversations about arbitrary individuals without fine-tuning. It builds PVIT through a three-phase automatic data-generation pipeline (visual concept curation, dual-level textual fusion, and PVIT dataset generation) to create PVIT-3M and a companion benchmark P-Bench. Results show that PVIT markedly improves personalized perception and dialogue, outperforming state-of-the-art MLLMs on both recognition and description tasks, and robustness grows with data scale and name diversity. The work enables realistic personalized visual assistants and domestic robots by embedding individualized reasoning into MLLMs via prefixes and wrapper tokens.

Abstract

Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as "face blindness". Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.

Paper Structure

This paper contains 40 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: The Personalized Visual Instruction Tuning (PVIT) framework consists of three phases. In the visual concept curation phase, we extract individuals and their faces from images, then augment them with different poses and angles. During the dual-level textual information extraction and fusion phase, MLLMs first generate both holistic information and personal information, then integrate them to get more detailed and contextually accurate information. In the PVIT dataset generation phase, LLMs create QA pair templates based on the extracted textual information, which are filled with selected names to construct training data.
  • Figure 2: Qualitative examples of P-LLaVA results: Each example includes the user's query, input individual photos, and the scene image. The current MLLMs fail to recognize the person of interest and conduct personalized conversations, whereas our model, after training with PVIT, enables coherent and accurate personalized dialogues. Examples illustrate both answerable and unanswerable scenarios. For answerable cases, inputs involve single or multiple individuals, and our model incorporates names from the prefix for personalized responses. In unanswerable cases, current MLLMs provide incorrect answers, while the our model appropriately refuses and explains the reason.
  • Figure 3: Statistics of PVIT-3M, a large scale personalized instruct tuning dataset. Left: Data Distribution within Each Category. The outer circle shows the distribution of all data categories and the inner circle shows the distribution of data subsets. Right: The detailed quantities of datasets.
  • Figure :