Table of Contents
Fetching ...

MyVLM: Personalizing VLMs for User-Specific Queries

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or

TL;DR

This work takes a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts, and demonstrates the ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.

Abstract

Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.

MyVLM: Personalizing VLMs for User-Specific Queries

TL;DR

This work takes a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts, and demonstrates the ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.

Abstract

Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.
Paper Structure (54 sections, 4 equations, 25 figures, 7 tables)

This paper contains 54 sections, 4 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Given a set of images depicting user-specific concepts such as $\langle$you$\rangle$, $\langle$your-dog$\rangle$ and $\langle$your-friend$\rangle$ (left), we teach a pretrained vision-language model (VLM) to understand and reason over these concepts. First, we enable the model to generate personalized captions incorporating the concept into its output text (middle). We further allow the user to ask subject-specific questions about these concepts, querying the model with questions such as "What are $\langle$you$\rangle$ doing?" or "What is my $\langle$your-friend$\rangle$ wearing?" (right).
  • Figure 2: MyVLM overview, applied over BLIP-2. Given an input image, we pass it through the frozen vision encoder of the VLM. In parallel, we pass the image through a set of learned concept heads, each tasked with recognizing a single user-specific concept. We append the concept embedding of the identified concept to the extracted vision features. These features are then passed to the Q-Former via a set of cross-attention layers to extract relevant information from the image features and concept embedding. Given the Q-Former outputs and language instruction, the frozen LLM outputs a response incorporating the concept identifier while remaining aligned with the input.
  • Figure 3: Self-attention visualization. We examine the self-attention of LLaVA's language model to visualize the attention weights assigned from the concept embedding to each image feature. As can be seen, the concept embedding attends to relevant regions within the images, assigning higher weights to areas where the concept is located.
  • Figure 4: Personalized captioning results obtained by MyVLM, applied over LLaVA liu2023llava. Sample images of the target concept are provided in the top row. Text in green highlights the description of the target concept in the image.
  • Figure 5: Comparison to the LLM-guided captioning baseline. Results are obtained over LLaVA liu2023llava. Sample images of the target concept are shown in the top row. Additional comparisons to all baselines over BLIP-2 li2023blip and LLaVA are provided in \ref{['sec:additional_results']}.
  • ...and 20 more figures