Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar, Sreyan Ghosh, S Sakshi, Utkarsh Tyagi, Dinesh Manocha
TL;DR
This work investigates whether open-vocabulary vision-language models (VLMs) understand compound nouns (CNs). It introduces Compun, a benchmark of 400 CNs with corresponding CN images and distractors that depict the CN constituents, framing CN interpretation as zero-shot text-to-image retrieval. To improve CN understanding, the authors propose retrieval with example captions: an LLM generates 5 diverse captions containing the CN, which are used to form prompts for image retrieval, and the image with the highest mean similarity to these prompts is selected. Experiments show that this caption-based prompting significantly boosts CLIP performance (≈8.25%) and also helps OpenCLIP (≈2.35%), highlighting that language-driven context can mitigate CN interpretation challenges in VLMs. The study reveals CLIP’s limited semantic grounding for attributed CNs and provides a concrete, scalable approach to enhance CN comprehension in retrieval tasks.
Abstract
Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs' limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available at: https://github.com/sonalkum/Compun
