Table of Contents
Fetching ...

Personalized Large Vision-Language Models

Chau Pham, Hoang Phan, David Doermann, Yunjie Tian

TL;DR

PLVM introduces Aligner, an online encoding module that maps personalized concepts from a single reference image into online features used by large vision-language models, enabling real-time, cost-free concept expansion without test-time fine-tuning. By leveraging a pre-trained vision encoder (DINO-v2) to derive $z_{ref}$ and producing $e^{word}$, $e^{weight}$ alongside $k$ context embeddings via a lightweight transformer, PLVM integrates personalized cues into an LVLM (LLaVA) for referential dialogue and QA. Empirical results show PLVM outperforms baselines such as YoLLaVA, LLaVA, and GPT-4V under token-constrained prompts, and extends naturally to multi-image personalization with competitive gains. A synthetic data pipeline using IP-Adapter and CelebA-HQ underpins training, enabling scalable, continuous personalization without costly fine-tuning. The work demonstrates practical, scalable personalization for LVLMs while highlighting limitations when features are shared across identities and tangent improvements for robustness and broader applicability.

Abstract

The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.

Personalized Large Vision-Language Models

TL;DR

PLVM introduces Aligner, an online encoding module that maps personalized concepts from a single reference image into online features used by large vision-language models, enabling real-time, cost-free concept expansion without test-time fine-tuning. By leveraging a pre-trained vision encoder (DINO-v2) to derive and producing , alongside context embeddings via a lightweight transformer, PLVM integrates personalized cues into an LVLM (LLaVA) for referential dialogue and QA. Empirical results show PLVM outperforms baselines such as YoLLaVA, LLaVA, and GPT-4V under token-constrained prompts, and extends naturally to multi-image personalization with competitive gains. A synthetic data pipeline using IP-Adapter and CelebA-HQ underpins training, enabling scalable, continuous personalization without costly fine-tuning. The work demonstrates practical, scalable personalization for LVLMs while highlighting limitations when features are shared across identities and tangent improvements for robustness and broader applicability.

Abstract

The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.

Paper Structure

This paper contains 16 sections, 2 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: With personalized concepts, PLVM enhances user interaction with large vision-language models, making it easier and more intuitive.
  • Figure 2: The overall framework of PLVM, where the large language model receives the spatial features, the concept template of a personalized reference image, text prompt, and produces the answer.
  • Figure 3: Examples of our synthetic data using Diffusion models rombach2022high, which is capable of generating customized images based on a reference appearance and a given prompt.
  • Figure 4: Examples of the three types of evaluation questions: recognition, text-only QA, and visual QA questions.
  • Figure 5: Qualitative results compared with YoLLaVA, LLaVA (with prompt), and GPT-4V (with prompt). PLVM requires no fine-tuning for every concept, enabling seamless incorporation of new concepts, as shown in the figure. In contrast, YoLLaVA needs $\sim$40 minutes of fine-tuning for each new concept to achieve personalization. Zoom in for details.
  • ...and 4 more figures