Table of Contents
Fetching ...

Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples

Tadej Škvorc, Marko Robnik-Šikonja

TL;DR

This work tackles data scarcity for word sense tasks in low-resource languages by leveraging large language models to expand dictionary usage examples into full sentences. It reframes WSD and WSI through the data-efficient WiC task and demonstrates that WiC-trained models can transfer to sense induction and disambiguation without comprehensive sense inventories. Applied to Slovene with dictionary and cross-lingual resources, the approach yields high WSD accuracy (roughly 93-94%) and competitive WSI performance, outperforming baselines that rely on short dictionary snippets. The study highlights the practical potential of combining dictionaries with LLMs for sense-aware NLP in under-resourced languages and points to scalability to other languages and larger datasets.

Abstract

Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.

Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples

TL;DR

This work tackles data scarcity for word sense tasks in low-resource languages by leveraging large language models to expand dictionary usage examples into full sentences. It reframes WSD and WSI through the data-efficient WiC task and demonstrates that WiC-trained models can transfer to sense induction and disambiguation without comprehensive sense inventories. Applied to Slovene with dictionary and cross-lingual resources, the approach yields high WSD accuracy (roughly 93-94%) and competitive WSI performance, outperforming baselines that rely on short dictionary snippets. The study highlights the practical potential of combining dictionaries with LLMs for sense-aware NLP in under-resourced languages and points to scalability to other languages and larger datasets.

Abstract

Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.

Paper Structure

This paper contains 17 sections, 1 figure, 8 tables.

Figures (1)

  • Figure 1: A flowchart demonstrating our methodology of using WiC for WSD & WSI. Dictionary definitions and usage snippets can be extended into WiC dataset that is used to create a same-sense classifier. This WiC classifier can be applied to all different word senses to solve the WSD task. If no sense is matched, we have a candidate for a new sense, i.e. addressing the WSI task. The WSD dataset can be replaced with examples generated from a dictionary for less-resourced languages.