Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?
Itay Cohen, Ethan Fetaya, Amir Rosenfeld
TL;DR
This work investigates whether modern vision–language models, notably CLIP, can distinguish real objects from look-alike instances such as toys, statues, drawings, and pareidolia. It introduces the RoLA dataset and develops a data-efficient Real/Look-alike direction $ ext{hat}{d}^{(-k)}$ in CLIP's embedding space by averaging category-wise mean differences, enabling both improved alignment of embeddings and controllable output shifts. Through prompt-based classification, embedding-direction augmentation for retrieval, and a CLIP prefix captioner, the approach yields substantial gains in cross-modal discrimination and captioning semantics, and shows strong generalization across held-out categories. The results suggest practical benefits for tasks requiring fine-grained, human-like object discrimination and offer a lightweight, easily deployable mechanism to bias model outputs toward realism or lookalike representations in multimodal systems.
Abstract
Recent advances in computer vision have yielded models with strong performance on recognition benchmarks; however, significant gaps remain in comparison to human perception. One subtle ability is to judge whether an image looks like a given object without being an instance of that object. We study whether vision-language models such as CLIP capture this distinction. We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars (e.g., toys, statues, drawings, pareidolia) across multiple categories, and first evaluate a prompt-based baseline with paired "real"/"lookalike" prompts. We then estimate a direction in CLIP's embedding space that moves representations between real and lookalike. Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval on Conceptual12M, and also enhances captions produced by a CLIP prefix captioner.
