Table of Contents
Fetching ...

Do Pre-trained Vision-Language Models Encode Object States?

Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, Chen Sun

TL;DR

This paper investigates whether pre-trained Vision-Language Models encode object states such as whole versus sliced forms. Using the ChangeIt-Frames benchmark, it shows that while object recognition is strong, state recognition lags significantly, even with larger models or multimodal LLMs; analysis reveals that text and visual representations are not discriminative for object states, and object-centric localization alone does not solve the issue. The authors explore remedies via object-centric representations, larger models, and multimodal LLMs, finding partial gains in concept binding but limited improvements in state discrimination. The work highlights the need for better localization, binding of concepts to objects, and state-focused pretraining objectives to enable physical commonsense reasoning in VLMs.

Abstract

For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.

Do Pre-trained Vision-Language Models Encode Object States?

TL;DR

This paper investigates whether pre-trained Vision-Language Models encode object states such as whole versus sliced forms. Using the ChangeIt-Frames benchmark, it shows that while object recognition is strong, state recognition lags significantly, even with larger models or multimodal LLMs; analysis reveals that text and visual representations are not discriminative for object states, and object-centric localization alone does not solve the issue. The authors explore remedies via object-centric representations, larger models, and multimodal LLMs, finding partial gains in concept binding but limited improvements in state discrimination. The work highlights the need for better localization, binding of concepts to objects, and state-focused pretraining objectives to enable physical commonsense reasoning in VLMs.

Abstract

For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e.g. a whole apple into a sliced apple). Our paper aims to investigate if VLMs pre-trained on web-scale data learn to encode object states, which can be extracted with zero-shot text prompts. We curate an object state recognition dataset ChangeIt-Frames, and evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We observe that while these state-of-the-art vision-language models can reliably perform object recognition, they consistently fail to accurately distinguish the objects' physical states. Through extensive experiments, we identify three areas for improvements for VLMs to better encode object states, namely the quality of object localization, the architecture to bind concepts to objects, and the objective to learn discriminative visual and language encoders on object states. Data and code are released.
Paper Structure (13 sections, 2 equations, 4 figures, 5 tables)

This paper contains 13 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: ChangeIt-Frames dataset. The images are sourced from instructional videos soucek2022lookforthechange. We use Amazon MTurk to manually verify a subset of the image annotations, and to draw bounding boxes for the objects of interest. For evaluation, we ask an VLM to choose the correct object state among ten candidates prompts selected via standard or distractor strategies.
  • Figure 2: The T-SNE visualization of CLIP text embeddings. The representations of the text prompts are clustered by the object category and the representations for different states of the same object are very similar.
  • Figure 3: TSNE projections of CLIP visual embeddings for "bacon". This includes original and cropped images for both states.
  • Figure A1: Visualization of Initial and Terminal states from categories in ChangeIT-Frames