Personalized Large Vision-Language Models

Chau Pham; Hoang Phan; David Doermann; Yunjie Tian

Personalized Large Vision-Language Models

Chau Pham, Hoang Phan, David Doermann, Yunjie Tian

TL;DR

PLVM introduces Aligner, an online encoding module that maps personalized concepts from a single reference image into online features used by large vision-language models, enabling real-time, cost-free concept expansion without test-time fine-tuning. By leveraging a pre-trained vision encoder (DINO-v2) to derive $z_{ref}$ and producing $e^{word}$, $e^{weight}$ alongside $k$ context embeddings via a lightweight transformer, PLVM integrates personalized cues into an LVLM (LLaVA) for referential dialogue and QA. Empirical results show PLVM outperforms baselines such as YoLLaVA, LLaVA, and GPT-4V under token-constrained prompts, and extends naturally to multi-image personalization with competitive gains. A synthetic data pipeline using IP-Adapter and CelebA-HQ underpins training, enabling scalable, continuous personalization without costly fine-tuning. The work demonstrates practical, scalable personalization for LVLMs while highlighting limitations when features are shared across identities and tangent improvements for robustness and broader applicability.

Abstract

The personalization model has gained significant attention in image generation yet remains underexplored for large vision-language models (LVLMs). Beyond generic ones, with personalization, LVLMs handle interactive dialogues using referential concepts (e.g., ``Mike and Susan are talking.'') instead of the generic form (e.g., ``a boy and a girl are talking.''), making the conversation more customizable and referentially friendly. In addition, PLVM is equipped to continuously add new concepts during a dialogue without incurring additional costs, which significantly enhances the practicality. PLVM proposes Aligner, a pre-trained visual encoder to align referential concepts with the queried images. During the dialogues, it extracts features of reference images with these corresponding concepts and recognizes them in the queried image, enabling personalization. We note that the computational cost and parameter count of the Aligner are negligible within the entire framework. With comprehensive qualitative and quantitative analyses, we reveal the effectiveness and superiority of PLVM.

Personalized Large Vision-Language Models

TL;DR

Abstract

Personalized Large Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)