Table of Contents
Fetching ...

Do Vision-Language Models Understand Compound Nouns?

Sonal Kumar, Sreyan Ghosh, S Sakshi, Utkarsh Tyagi, Dinesh Manocha

TL;DR

This work investigates whether open-vocabulary vision-language models (VLMs) understand compound nouns (CNs). It introduces Compun, a benchmark of 400 CNs with corresponding CN images and distractors that depict the CN constituents, framing CN interpretation as zero-shot text-to-image retrieval. To improve CN understanding, the authors propose retrieval with example captions: an LLM generates 5 diverse captions containing the CN, which are used to form prompts for image retrieval, and the image with the highest mean similarity to these prompts is selected. Experiments show that this caption-based prompting significantly boosts CLIP performance (≈8.25%) and also helps OpenCLIP (≈2.35%), highlighting that language-driven context can mitigate CN interpretation challenges in VLMs. The study reveals CLIP’s limited semantic grounding for attributed CNs and provides a concrete, scalable approach to enhance CN comprehension in retrieval tasks.

Abstract

Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs' limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available at: https://github.com/sonalkum/Compun

Do Vision-Language Models Understand Compound Nouns?

TL;DR

This work investigates whether open-vocabulary vision-language models (VLMs) understand compound nouns (CNs). It introduces Compun, a benchmark of 400 CNs with corresponding CN images and distractors that depict the CN constituents, framing CN interpretation as zero-shot text-to-image retrieval. To improve CN understanding, the authors propose retrieval with example captions: an LLM generates 5 diverse captions containing the CN, which are used to form prompts for image retrieval, and the image with the highest mean similarity to these prompts is selected. Experiments show that this caption-based prompting significantly boosts CLIP performance (≈8.25%) and also helps OpenCLIP (≈2.35%), highlighting that language-driven context can mitigate CN interpretation challenges in VLMs. The study reveals CLIP’s limited semantic grounding for attributed CNs and provides a concrete, scalable approach to enhance CN comprehension in retrieval tasks.

Abstract

Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting CNs. The Compun benchmark challenges a VLM for text-to-image retrieval where, given a text prompt with a CN, the task is to select the correct image that shows the CN among a pair of distractor images that show the constituent nouns that make up the CN. Next, we perform an in-depth analysis to highlight CLIPs' limited understanding of certain types of CNs. Finally, we present an alternative framework that moves beyond hand-written templates for text prompts widely used by CLIP-like models. We employ a Large Language Model to generate multiple diverse captions that include the CN as an object in the scene described by the caption. Our proposed method improves CN understanding of CLIP by 8.25% on Compun. Code and benchmark are available at: https://github.com/sonalkum/Compun
Paper Structure (15 sections, 2 equations, 4 figures, 5 tables)

This paper contains 15 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of our proposed Retrieval with Captions. We first generate 5 diverse captions describing 5 diverse scenes, with the compound noun as an object in it. These captions are then used to build 5 custom text prompts for text-to-image retrieval, and the image with the highest mean similarity to all 5 prompts is then selected for retrieval.
  • Figure 2: Illustration of 3 types of CNs used in our study: Either Noun, Both Nouns and None. A brief explanation of the 3 types is provided in Section \ref{['sec:analysis']}. 1. (left) An example of Either Noun, where earring looks like an ordinary ring but not like an ear, and the noun ear just acts as an attribute that modifies the visual of a ring to an earring. 1. (right) An example of Either Noun, where coffee grain looks like an ordinary grain but is modified by the noun coffee, which acts as an attribute. 2. An example of None, where a cricket bat looks completely different from both cricket and bat. 3. An example of Both Nouns, where a snow ball looks both like snow and ball.
  • Figure 3: Count of misclassified instances by CLIP on Compun for three settings, either, both, and none. Section \ref{['sec:analysis']} describes these settings. CLIP is more likely to retrieve a negative when the positive image shows either constituent noun, highlighting CLIP's limited understanding of attributed CNs.
  • Figure 4: Average CLIP similarity scores for correct predictions on Compun on three unique settings, either, both, and none. Section \ref{['sec:analysis']} describes these settings. High scores on the Compun benchmark are superficial, and CLIP often wins by low similarity scores.