Table of Contents
Fetching ...

Ukrainian Visual Word Sense Disambiguation Benchmark

Yurii Laba, Yaryna Mohytych, Ivanna Rohulia, Halyna Kyryleyza, Hanna Dydyk-Meush, Oles Dobosevych, Rostyslav Hryniv

Abstract

This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.

Ukrainian Visual Word Sense Disambiguation Benchmark

Abstract

This study presents a benchmark for evaluating the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian. The main goal of the Visual-WSD task is to identify, with minimal contextual information, the most appropriate representation of a given ambiguous word from a set of ten images. To construct this benchmark, we followed a methodology similar to that proposed by (CITATION), who previously introduced benchmarks for the Visual-WSD task in English, Italian, and Farsi. This approach allows us to incorporate the Ukrainian benchmark into a broader framework for cross-language model performance comparisons. We collected the benchmark data semi-automatically and refined it with input from domain experts. We then assessed eight multilingual and multimodal large language models using this benchmark. All tested models performed worse than the zero-shot CLIP-based baseline model (CITATION) used by (CITATION) for the English Visual-WSD task. Our analysis revealed a significant performance gap in the Visual-WSD task between Ukrainian and English.
Paper Structure (14 sections, 2 equations, 2 figures, 1 table)

This paper contains 14 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: An illustration of GPT4-Vision visual hallucination caused by ambiguous target word.
  • Figure 2: Example of the benchmark entry. The word T2A Коса (en: braid, transl: kosa) is ambiguous. It corresponds to the meaning T2A Заплетене волосся; довге волосся (en: braided hair; long hair, transl: zapletene volossya; dovhe volossya). The word T2A Волосся (en: hair, transl: volossya) is the trigger word. The image that corresponds to the intended meaning is b (underlined). The other three images are examples of negative samples. Note: While the task involves nine negative images, we only display three negative images for simplicity.