Table of Contents
Fetching ...

Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

Itay Cohen, Ethan Fetaya, Amir Rosenfeld

TL;DR

This work investigates whether modern vision–language models, notably CLIP, can distinguish real objects from look-alike instances such as toys, statues, drawings, and pareidolia. It introduces the RoLA dataset and develops a data-efficient Real/Look-alike direction $ ext{hat}{d}^{(-k)}$ in CLIP's embedding space by averaging category-wise mean differences, enabling both improved alignment of embeddings and controllable output shifts. Through prompt-based classification, embedding-direction augmentation for retrieval, and a CLIP prefix captioner, the approach yields substantial gains in cross-modal discrimination and captioning semantics, and shows strong generalization across held-out categories. The results suggest practical benefits for tasks requiring fine-grained, human-like object discrimination and offer a lightweight, easily deployable mechanism to bias model outputs toward realism or lookalike representations in multimodal systems.

Abstract

Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.

Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?

TL;DR

This work investigates whether modern vision–language models, notably CLIP, can distinguish real objects from look-alike instances such as toys, statues, drawings, and pareidolia. It introduces the RoLA dataset and develops a data-efficient Real/Look-alike direction in CLIP's embedding space by averaging category-wise mean differences, enabling both improved alignment of embeddings and controllable output shifts. Through prompt-based classification, embedding-direction augmentation for retrieval, and a CLIP prefix captioner, the approach yields substantial gains in cross-modal discrimination and captioning semantics, and shows strong generalization across held-out categories. The results suggest practical benefits for tasks requiring fine-grained, human-like object discrimination and offer a lightweight, easily deployable mechanism to bias model outputs toward realism or lookalike representations in multimodal systems.

Abstract

Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.

Paper Structure

This paper contains 26 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Lookalike examples: top row—real objects; middle—drawings; bottom—pareidolia.
  • Figure 2: Retrieval with real prompt: Correct results in green, incorrect in red. Images ranked left to right by similarity; top five per row.
  • Figure 3: Retrieval with lookalike prompt: Correct results in green, incorrect in red. Images ranked left to right by similarity; top five per row.
  • Figure 4: Retrieval with lookalike prompts: accuracy per class before/after transformation.
  • Figure 5: Retrieval with real prompts: accuracy per class before/after transformation
  • ...and 5 more figures