Table of Contents
Fetching ...

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano

TL;DR

This work tackles the lack of spatial grounding in caption-based Vision-Language Models by introducing PIN, a lightweight, input-agnostic spatial prompt that is inserted after the vision encoder of a frozen VLM. Trained with a simple next-token objective on synthetically generated data, PIN enables zero-shot object localisation without adding new output heads or relying on localisation supervision. Across OpenFlamingo and BLIP-2, PIN achieves strong localisation on PVOC, COCO, LVIS, and zero-shot grounding on RefCOCO, while preserving the VLM's general abilities. The approach demonstrates that a minimal, trainable spatial prompt can unlock cross-domain localisation, offering a scalable, data-efficient path to grounded multimodal understanding.

Abstract

Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

TL;DR

This work tackles the lack of spatial grounding in caption-based Vision-Language Models by introducing PIN, a lightweight, input-agnostic spatial prompt that is inserted after the vision encoder of a frozen VLM. Trained with a simple next-token objective on synthetically generated data, PIN enables zero-shot object localisation without adding new output heads or relying on localisation supervision. Across OpenFlamingo and BLIP-2, PIN achieves strong localisation on PVOC, COCO, LVIS, and zero-shot grounding on RefCOCO, while preserving the VLM's general abilities. The approach demonstrates that a minimal, trainable spatial prompt can unlock cross-domain localisation, offering a scalable, data-efficient path to grounded multimodal understanding.

Abstract

Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.
Paper Structure (43 sections, 4 equations, 15 figures, 6 tables)

This paper contains 43 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: We learn a single Positional Insert (PIN) for unlocking zero-shot object localisation abilities in a frozen Vision Language Model (VLM) without adding any additional heads or requiring supervised datasets. Further output examples shown in Fig. \ref{['fig:ADE']} & \ref{['fig:coco_pvoc']}.
  • Figure 2: Examples from our analysis on localisation abilities of existing caption-based VLMs. GPT-4Vopen_ai-gpt4 is the only model to return bounding boxes and by that roughly localised the object. All other VLMs struggle to easily localise the objects in the image. Further examples and different kinds of prompts are provided in the supplemental Sec. \ref{['sec:extended_study']}.
  • Figure 3: Schematic overview of our method. We generate synthetic training data by overlaying objects on background images using our composition function $C$. These images are then encoded, and our lightweight learnable spatial prompt vector $\pi$ from the PIN module is added to their vision encodings $x_v$. Using the VLM's standard forward pass, a location text response is generated based on the input object name and the enhanced visual feature $x^{\star}_v$. The PIN module is optimized with cross-entropy by comparing this generated text against the known object locations from the composition function $C$.
  • Figure 4: Sample images from our synthetic data generation.
  • Figure 5: Localisation on a wide range of image types ranging from paintings, and comics to unique scenarios. Despite the varying image content, enhancing the OpenFlamingo caption-based VLM with our PIN shows strong localisation abilities.
  • ...and 10 more figures