Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Anastasia Kritharoula; Maria Lymperaiou; Giorgos Stamou

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou

TL;DR

This work tackles Visual Word Sense Disambiguation (VWSD), where the correct image must be retrieved from $n=10$ candidates given a context слово. It investigates a multimodal retrieval pipeline using Transformer-based VL models, augmented by Large Language Models as knowledge bases to enrich disambiguating phrases, and further enhanced by Chain-of-Thought prompting for explainability. A lightweight Learn to Rank module integrates features from diverse modules (baseline VL retrieval, LLM-enhanced phrases, captions, and web-image retrieval) to achieve competitive VWSD performance, with large LLMs (e.g., GPT-3/3.5-turbo) delivering the most significant gains. The results underscore the value of combining explicit knowledge augmentation with modular fusion and explainable prompting, offering practical pathways for robust, interpretable VWSD systems and multimodal retrieval pipelines.

Abstract

Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

TL;DR

This work tackles Visual Word Sense Disambiguation (VWSD), where the correct image must be retrieved from

candidates given a context слово. It investigates a multimodal retrieval pipeline using Transformer-based VL models, augmented by Large Language Models as knowledge bases to enrich disambiguating phrases, and further enhanced by Chain-of-Thought prompting for explainability. A lightweight Learn to Rank module integrates features from diverse modules (baseline VL retrieval, LLM-enhanced phrases, captions, and web-image retrieval) to achieve competitive VWSD performance, with large LLMs (e.g., GPT-3/3.5-turbo) delivering the most significant gains. The results underscore the value of combining explicit knowledge augmentation with modular fusion and explainable prompting, offering practical pathways for robust, interpretable VWSD systems and multimodal retrieval pipelines.

Abstract

Paper Structure (36 sections, 2 equations, 9 figures, 24 tables)

This paper contains 36 sections, 2 equations, 9 figures, 24 tables.

Related work
Text-image retrieval
LLMs as knowledge bases
Method
1. Image-Text similarity baseline
2. LLMs for phrase enhancement
3. Image captioning for text retrieval
4. Wikipedia & Wikidata image retrieval
5. Learn to Rank
6. Question-answering for VWSD and CoT prompting
Experimental results
LLMs for phrase enhancement
Image captioning
Wikipedia & Wikidata image retrieval
Learn to rank
...and 21 more sections

Figures (9)

Figure 1: An example of the VWSD task.
Figure 2: Candidate images for the phrase "rowing dory".
Figure 3: Candidate images for the phrase "greeting card".
Figure 4: Candidate images for the phrase "suede chamois".
Figure 5: Example 1. Candidate images for the phrase "metal steel".
...and 4 more figures

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

TL;DR

Abstract

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)