Table of Contents
Fetching ...

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue

TL;DR

This work tackles the challenge of personalizing multimodal LLMs by introducing the Retrieval-Augmented Personalization (RAP) framework, which外stores user concepts in an external memory and uses a multimodal retriever to fetch relevant knowledge for personalized generation. RAP follows a Remember–Retrieve–Generate pipeline, enabling real-time concept editing without retraining and supporting infinite new concepts after pretraining on a specialized personalization dataset. A large-scale RAP dataset is constructed to train RAP-MLLMs (e.g., RAP-LLaVA, RAP-Phi3-V) for tasks such as personalized image captioning, question answering, and visual recognition, achieving superior performance and data efficiency compared to baselines. The approach demonstrates practical benefits for deploying personalized multimodal assistants on resource-constrained devices, with the potential to adapt quickly to new user concepts while maintaining privacy through local memory. Overall, RAP offers a scalable pathway to personalized, knowledge-augmented multimodal interaction without repeated model updates.

Abstract

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/.

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

TL;DR

This work tackles the challenge of personalizing multimodal LLMs by introducing the Retrieval-Augmented Personalization (RAP) framework, which外stores user concepts in an external memory and uses a multimodal retriever to fetch relevant knowledge for personalized generation. RAP follows a Remember–Retrieve–Generate pipeline, enabling real-time concept editing without retraining and supporting infinite new concepts after pretraining on a specialized personalization dataset. A large-scale RAP dataset is constructed to train RAP-MLLMs (e.g., RAP-LLaVA, RAP-Phi3-V) for tasks such as personalized image captioning, question answering, and visual recognition, achieving superior performance and data efficiency compared to baselines. The approach demonstrates practical benefits for deploying personalized multimodal assistants on resource-constrained devices, with the potential to adapt quickly to new user concepts while maintaining privacy through local memory. Overall, RAP offers a scalable pathway to personalized, knowledge-augmented multimodal interaction without repeated model updates.

Abstract

The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at https://hoar012.github.io/RAP-Project/.

Paper Structure

This paper contains 24 sections, 1 equation, 9 figures, 28 tables.

Figures (9)

  • Figure 1: Introduce some user-specific concepts to our RAP-MLLM, it can remember them and achieve excellent performance in a variety of personalized multimodal generation tasks.
  • Figure 2: Retrieval-Augmented Personalization Framework. Region-of-interest detected by an open world detector are used to retrieve concepts from the database. The images and information of the retrieved concepts are then integrated into the input for the MLLM.
  • Figure 3: Our Pipeline for Data Collection. We first crop the target concept from the image based on the dataset annotations and then query Gemini gemini to generate its personalized description. We also apply data augmentation to diversify these cropped images. Then we combine them with the original image to derive a series of instructions and answers from Gemini. When noise concepts are included in the additional information, the answer remains unchanged, helping to train the MLLMs' ability to filter out irrelevant concepts.
  • Figure 4: Performance under varying number of personalized concepts.
  • Figure 5: Retriever's Top-K Recall under varying database size N.
  • ...and 4 more figures