Table of Contents
Fetching ...

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

TL;DR

MC-LLaVA tackles the lack of multi-concept personalization in vision-language models by introducing a joint multi-concept instruction-tuning framework with personalized textual prompts initialized from visual tokens and a training-free, location-aware personalized visual prompt for grounding. It also provides a high-quality multi-concept instruction dataset created from concept-rich movies and GPT-4o-assisted QA data to support evaluation across recognition, grounding, QA, and captioning. Empirical results demonstrate state-of-the-art performance on recognition and grounding, competitive QA performance relative to GPT-4o, and strong captioning recall, while reducing training costs via token initialization and joint training. The work advances practical, user-specific VLM assistants and offers a public dataset and codebase to spur further research in multi-concept personalization.

Abstract

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

TL;DR

MC-LLaVA tackles the lack of multi-concept personalization in vision-language models by introducing a joint multi-concept instruction-tuning framework with personalized textual prompts initialized from visual tokens and a training-free, location-aware personalized visual prompt for grounding. It also provides a high-quality multi-concept instruction dataset created from concept-rich movies and GPT-4o-assisted QA data to support evaluation across recognition, grounding, QA, and captioning. Empirical results demonstrate state-of-the-art performance on recognition and grounding, competitive QA performance relative to GPT-4o, and strong captioning recall, while reducing training costs via token initialization and joint training. The work advances practical, user-specific VLM assistants and offers a public dataset and codebase to spur further research in multi-concept personalization.

Abstract

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes the first multi-concept personalization paradigm, MC-LLaVA. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the costs related to joint training, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location confidence maps for enhanced recognition and grounding capabilities. To advance multi-concept personalization research, we further contribute a high-quality instruction tuning dataset. We carefully collect images with multiple characters and objects from movies and manually generate question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive qualitative and quantitative experiments demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at https://github.com/arctanxarc/MC-LLaVA.

Paper Structure

This paper contains 35 sections, 6 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Case studies utilizing various concepts from the Yo’LLaVA dataset. The left panel shows the limitations of separately trained Yo'LLaVA models, while the right panel emphasizes the significance of high-quality negative samples for Yo'LLaVA.
  • Figure 2: The vanilla LLaVA fails to understand user-provided concepts. Existing methods like Yo'LLaVA mainly focus on single-concept personalization and cannot generate accurate, personalized responses based on multi-concepts. The proposed MC-LLaVA learns multiple concepts and can perform accurately in multi-concept personalization across various tasks such as recognition, VQA, and captioning.
  • Figure 3: The illustration of MC-LLaVA. (a) We use a multi-concept joint training strategy to learn the personalized textual prompts and classifier weights. (b) Given $m$ concepts, we utilize visual tokens obtained from K-means centroids to initialize the $m \times (k+1)$ concept tokens in personalized textual prompts, reducing the costs associated with joint training. (c) During inference, we introduce a personalized visual prompt for VLMs by aggregating location confidence maps based on learned concept tokens.
  • Figure 4: Examples of the proposed multiple concept personalization dataset. The dataset includes not only adults but also children, animals and objects, derived from cartoons and movies. To facilitate visualization, concept identifiers have been abbreviated using letters.
  • Figure 5: Training progress on high-quality negative samples and k-means initialization.
  • ...and 8 more figures