Table of Contents
Fetching ...

Ego: Embedding-Guided Personalization of Vision-Language Models

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi

TL;DR

This work proposes an efficient personalization method that leverages the model's inherent ability to capture personalized concepts by utilizing the model's internal attention mechanisms and extracts visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms.

Abstract

AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.

Ego: Embedding-Guided Personalization of Vision-Language Models

TL;DR

This work proposes an efficient personalization method that leverages the model's inherent ability to capture personalized concepts by utilizing the model's internal attention mechanisms and extracts visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms.

Abstract

AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
Paper Structure (29 sections, 4 equations, 6 figures, 9 tables)

This paper contains 29 sections, 4 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Personalization approaches vs Ego. Existing methods typically require test-time or LVLM fine-tuning, or depend on external vision modules, and often fail to support multi-concept or video-level personalization. In contrast, Ego is training-free, LVLM-agnostic, requires no external modules, and efficiently enables single-concept, multi-concept, and video-native personalization within a unified framework.
  • Figure 2: Our proposed method Ego. Personalized Concept Introduction: The LVLM is tasked to estimate the subject area in the reference image and generate keywords describing its main characteristics. Ego identifies the most representative visual tokens via keywords cross cross-attention and creates a concept memory. Inference: Given a test image, the LVLM in Ego accesses internal concept memories in context to recall and reason about known subjects in the image. Ego requires neither additional training nor external modules.
  • Figure 3: Qualitative results. Top row: keywords and highlighted patches of selected visual tokens (Ego concept memory) for various concepts, illustrating their representativeness and adaptability to object size. Bottom row: Ego demonstrates Video QA capability.
  • Figure 4: Extracted visual tokens for Zak's Dog Coffee. Left: Fixed $K=50$Right: Dynamic $K_c = 25$. Ego removes 25 background patches by adapting to the concept's size.
  • Figure 5: Ego's generated keywords, estimated concept sizes, and selected patches are shown using examples from various datasets. Ego demonstrates the ability to accurately estimate concept sizes and extract informative patches for each object while minimizing background interference.
  • ...and 1 more figures