Language Models as Knowledge Bases for Visual Word Sense Disambiguation

Anastasia Kritharoula; Maria Lymperaiou; Giorgos Stamou

Language Models as Knowledge Bases for Visual Word Sense Disambiguation

Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou

TL;DR

The paper addresses VWSD by leveraging Large Language Models as Knowledge Bases to enrich phrases in zero-shot fashion and by reformulating VWSD as a textual QA task using captions as answer choices. It systematically compares LLM-based phrase enhancement with QA prompting (zero-shot and few-shot, including CoT) across multiple VL backbones and captioners. Findings show that knowledge-enhancement prompts (notably meaning_of) can improve retrieval performance and that model scale critically affects QA prompting success, with prompts like choose-CoT offering gains in some settings. The work demonstrates the potential and limitations of current LLM prompting for multimodal retrieval and highlights directions for improving captioning, reasoning, and scalability in VWSD systems.

Abstract

Visual Word Sense Disambiguation (VWSD) is a novel challenging task that lies between linguistic sense disambiguation and fine-grained multimodal retrieval. The recent advancements in the development of visiolinguistic (VL) transformers suggest some off-the-self implementations with encouraging results, which however we argue that can be further improved. To this end, we propose some knowledge-enhancement techniques towards improving the retrieval performance of VL transformers via the usage of Large Language Models (LLMs) as Knowledge Bases. More specifically, knowledge stored in LLMs is retrieved with the help of appropriate prompts in a zero-shot manner, achieving performance advancements. Moreover, we convert VWSD to a purely textual question-answering (QA) problem by considering generated image captions as multiple-choice candidate answers. Zero-shot and few-shot prompting strategies are leveraged to explore the potential of such a transformation, while Chain-of-Thought (CoT) prompting in the zero-shot setting is able to reveal the internal reasoning steps an LLM follows to select the appropriate candidate. In total, our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve WVSD.

Language Models as Knowledge Bases for Visual Word Sense Disambiguation

TL;DR

Abstract

Language Models as Knowledge Bases for Visual Word Sense Disambiguation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)