Table of Contents
Fetching ...

Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

Shamik Bhattacharya, Daniel Perkins, Yaren Dogan, Vineeth Konjeti, Sudarshan Srinivasan, Edmon Begoli

TL;DR

This work tackles lexical ambiguity in natural language by focusing on Visual Word Sense Disambiguation (VWSD) and leveraging CLIP to map ambiguous text and candidate images into a shared multimodal space. It introduces a dual-channel text prompting framework (semantic and photo prompts) combined with WordNet synonyms and a robust test-time image augmentation pipeline, optimized via cosine similarity to select the best image for the target word sense. Ablation studies show prompting provides strong, low-latency gains, while aggressive image augmentations yield marginal improvements and can increase latency; multilingual translations and heavy WordNet usage can inject noise. The proposed approach achieves an MRR of 0.7590 and a Hit Rate of 0.6220 on SemEval-2023 VWSD, illustrating interpretability and efficiency advantages with competitive performance and a clear avenue for future enhancements to bridge remaining gaps with state-of-the-art systems.

Abstract

Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.

Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

TL;DR

This work tackles lexical ambiguity in natural language by focusing on Visual Word Sense Disambiguation (VWSD) and leveraging CLIP to map ambiguous text and candidate images into a shared multimodal space. It introduces a dual-channel text prompting framework (semantic and photo prompts) combined with WordNet synonyms and a robust test-time image augmentation pipeline, optimized via cosine similarity to select the best image for the target word sense. Ablation studies show prompting provides strong, low-latency gains, while aggressive image augmentations yield marginal improvements and can increase latency; multilingual translations and heavy WordNet usage can inject noise. The proposed approach achieves an MRR of 0.7590 and a Hit Rate of 0.6220 on SemEval-2023 VWSD, illustrating interpretability and efficiency advantages with competitive performance and a clear avenue for future enhancements to bridge remaining gaps with state-of-the-art systems.

Abstract

Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.
Paper Structure (34 sections, 9 equations, 11 figures, 11 tables)

This paper contains 34 sections, 9 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Two images illustrating the ambiguity of the word "bank": one shows riverbank erosion, the other a piggy bank.
  • Figure 2: Normalization of the textual and visual input before they are passed into the vision language models. The sentence "Internet Router" (with the underlined target word "router") is normalized and tokenized. Additionally, the images are resized and normalized.
  • Figure 3: The pipeline for generating contextual embeddings. In this example, the phrase "internet router" is tokenized. An initial embedding is created for the tokens of the target word (top) and the surrounding words (bottom). Each embedding is then passed into a transformer. Finally, the hidden states for the tokens of the target word are pooled together to create the contextual embedding.
  • Figure 4: Overview of the vanilla CLIP-based model. Both text and image inputs are encoded into a shared multi-modal space, and cosine similarity determines which image best aligns with the contextual meaning of the target word.
  • Figure 5: The final model. The target-sentence pair is expanded with a dual-channel prompt which consists of semantic and photo prompts. WordNet synonyms of both context and target words are also included for additional photo-prompts. All prompts are encoded by CLIP, averaged within each channel, fused with learned weights, and L2-normalized to produce robust textual embeddings. On the image side, each candidate image is transformed into a large set of stochastic multi-view augmentations. Each view is also encoded with CLIP, normalized, and the mean image embedding is computed. Cosine similarity between the final text embedding and each image embedding yields a score for ranking the images.
  • ...and 6 more figures