Table of Contents
Fetching ...

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi

TL;DR

The paper investigates whether multimodal resamplers that compress visual features into a visual prompt preserve fine-grained spatial information essential for spatial understanding tasks. It introduces diagnostic probing with explicit and implicit tasks to evaluate spatial grounding, comparing frozen versus jointly trained resamplers and probes. The key finding is that spatial information is largely absent when resamplers are frozen, but joint training with probes reveals that the information can be encoded, suggesting that current pretraining objectives lack explicit object-centric grounding. This work highlights the need for object-aware pretraining objectives to cultivate spatially disentangled representations and guides future design of V&L systems toward better fine-grained spatial understanding with resamplers.

Abstract

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

TL;DR

The paper investigates whether multimodal resamplers that compress visual features into a visual prompt preserve fine-grained spatial information essential for spatial understanding tasks. It introduces diagnostic probing with explicit and implicit tasks to evaluate spatial grounding, comparing frozen versus jointly trained resamplers and probes. The key finding is that spatial information is largely absent when resamplers are frozen, but joint training with probes reveals that the information can be encoded, suggesting that current pretraining objectives lack explicit object-centric grounding. This work highlights the need for object-aware pretraining objectives to cultivate spatially disentangled representations and guides future design of V&L systems toward better fine-grained spatial understanding with resamplers.

Abstract

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability
Paper Structure (19 sections, 5 figures, 5 tables)

This paper contains 19 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Explicit (left) and implicit (right) probing for spatial understanding. In the explicit setting, we probe for region localization, while in the implicit setting, the probe is trained to classify whether a description involving an image region is true of the image.
  • Figure 2: Performance on (a) VSR per intermediate layer, (b) RefCOCOg per MSCOCO super-category.
  • Figure 3: Illustration of positive (a) and negative (b) examples from the RCM task.
  • Figure 4: Performance of Q-Former on RefCOCOg per intermediate layer.
  • Figure 5: Performance of InstructBLIP Q-Former on RefCOCOg per MSCOCO super-category.