Table of Contents
Fetching ...

Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon Ullman, M. Jehanzeb Mirza

TL;DR

This work tackles few-shot personalized object localization in Vision-Language Models (VLMs) by introducing IPLoc, a data-centric approach that leverages in-context examples to localize the same object type in a query image without retraining. It constructs context-rich instruction-tuning conversations from video object-tracking data and employs pseudo-name regularization to force reliance on visual context over pre-existing knowledge. Through LoRA-based fine-tuning across diverse model backbones (up to 72B parameters) and three large tracking datasets, IPLoc achieves notable improvements in personalized localization performance and demonstrates robust generalization, while exposing weaknesses in current models like GPT-4o for this task. The approach provides a foundation for context-driven vision-language applications and highlights how data design can unlock contextual learning in multimodal models.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs -- exposing critical weaknesses in present-day VLMs, and laying a foundation for future research in context-driven vision-language applications.

Teaching VLMs to Localize Specific Objects from In-context Examples

TL;DR

This work tackles few-shot personalized object localization in Vision-Language Models (VLMs) by introducing IPLoc, a data-centric approach that leverages in-context examples to localize the same object type in a query image without retraining. It constructs context-rich instruction-tuning conversations from video object-tracking data and employs pseudo-name regularization to force reliance on visual context over pre-existing knowledge. Through LoRA-based fine-tuning across diverse model backbones (up to 72B parameters) and three large tracking datasets, IPLoc achieves notable improvements in personalized localization performance and demonstrates robust generalization, while exposing weaknesses in current models like GPT-4o for this task. The approach provides a foundation for context-driven vision-language applications and highlights how data design can unlock contextual learning in multimodal models.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances the few-shot localization performance of recent VLMs ranging from 7B to 72B in size, without sacrificing generalization, as demonstrated on several benchmarks tailored towards evaluating personalized localization abilities. This work is the first to explore and benchmark personalized few-shot localization for VLMs -- exposing critical weaknesses in present-day VLMs, and laying a foundation for future research in context-driven vision-language applications.

Paper Structure

This paper contains 28 sections, 3 equations, 14 figures, 23 tables.

Figures (14)

  • Figure 1: In-context personalized localization involves localizing object instances present in a scene (or query image) similar to the object presented as an in-context example. In this setting, the input to the model is a category name, in-context image, bounding box coordinates (not shown in this figure), and a query image. The model is tasked with localizing the same category of interest (presented as an in-context example) in the query image. Here, we visualize a few inputs and outputs from various VLMs highlighting that our fine-tuned model better captures the information in the in-context image.
  • Figure 2: Overview of data creation and conversation format. To instill few-shot personalized localization abilities in VLMs, our IPLoc creates multi-modal conversations by harnessing data from multiple video object tracking datasets. For semantic coherence, focus on personalization and stronger contextual awareness, we create these conversations by sampling frames from the same video, tracking a particular object of interest, and enhancing the training data by extending the conversations by replacing the true category name with pseudo names. These conversations are later employed to induce contextual awareness in VLMs.
  • Figure 3: Effect of increasing number of shots. We report mIOU (%) on the LASOT lasot test split. IPLoc refers to the Qwen2-VL qwen2vl fine-tuned on the proposed data mix in this paper.
  • Figure 4: Qwen2-VL Original Prompt
  • Figure 5: Qwen2-VL / InternVL2 Prompt 1
  • ...and 9 more figures