Table of Contents
Fetching ...

MMPB: It's Time for Multi-Modal Personalization

Jaeik Kim, Woojin Kim, Woohyeon Park, Jaeyoung Do

TL;DR

MMPB introduces the first comprehensive benchmark for evaluating personalization in multi-modal vision-language models. It formalizes four core personalization criteria and a principled injection mechanism for user concepts, then evaluates 23 diverse VLMs on recognition and preference-grounded VQA across 111 concepts and 10,017 image–query pairs. The study reveals persistent challenges: models struggle with preference-grounded abductive reasoning, exhibit safety-driven evasiveness, and show degraded persistency over long, multi-turn dialogues, especially with image-based concept injections. By detailing systematic failure modes and offering a scalable benchmark with multi-turn evaluation, MMPB provides a foundation for advancing truly personalized multi-modal AI with implications for smart homes, healthcare, and human-centric interaction.

Abstract

Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

MMPB: It's Time for Multi-Modal Personalization

TL;DR

MMPB introduces the first comprehensive benchmark for evaluating personalization in multi-modal vision-language models. It formalizes four core personalization criteria and a principled injection mechanism for user concepts, then evaluates 23 diverse VLMs on recognition and preference-grounded VQA across 111 concepts and 10,017 image–query pairs. The study reveals persistent challenges: models struggle with preference-grounded abductive reasoning, exhibit safety-driven evasiveness, and show degraded persistency over long, multi-turn dialogues, especially with image-based concept injections. By detailing systematic failure modes and offering a scalable benchmark with multi-turn evaluation, MMPB provides a foundation for advancing truly personalized multi-modal AI with implications for smart homes, healthcare, and human-centric interaction.

Abstract

Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB

Paper Structure

This paper contains 66 sections, 1 equation, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Examples of personalized queries across task types and representative failure cases of recent VLMs. indicates GPT-4o, while indicates LLaVA family models such as LLaVA-NeXT.
  • Figure 2: Overview of MMPB. (top) A three‐step construction process ensuring high quality and scalability. (bottom) An evaluation protocol for assessing the VLM’s personalization criteria.
  • Figure 3: Example of quality control for a Coherency-type query with concept-only and image-only distractors.
  • Figure 4: Evaluation results of 23 VLMs on MMPB under 0-turn and 10-turn settings. Model names are followed by their average ranks across eight general-purpose multi-modal benchmarks.
  • Figure 5: Performance gap between preference-grounded and recognition VQA tasks in VLMs.
  • ...and 21 more figures