Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples
Tadej Škvorc, Marko Robnik-Šikonja
TL;DR
This work tackles data scarcity for word sense tasks in low-resource languages by leveraging large language models to expand dictionary usage examples into full sentences. It reframes WSD and WSI through the data-efficient WiC task and demonstrates that WiC-trained models can transfer to sense induction and disambiguation without comprehensive sense inventories. Applied to Slovene with dictionary and cross-lingual resources, the approach yields high WSD accuracy (roughly 93-94%) and competitive WSI performance, outperforming baselines that rely on short dictionary snippets. The study highlights the practical potential of combining dictionaries with LLMs for sense-aware NLP in under-resourced languages and points to scalability to other languages and larger datasets.
Abstract
Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.
