Table of Contents
Fetching ...

Presence or Absence: Are Unknown Word Usages in Dictionaries?

Xianghe Ma, Dominik Schlechtweg, Wei Zhao

TL;DR

This work addresses bridging lexical semantic change detection with lexicography by aligning diachronic word usages to dictionary senses and generating dictionary-like definitions for novel senses. It advances an unsupervised, graph-based clustering approach for Subtask 1 and employs prompt-based, training-free LLMs for Subtask 2, evaluated on AXOLOTL-24 across Finnish, Russian, and German. The results show the graph-based method outperforms the baseline in Subtask 1, while LLM-driven definition generation can outperform supervised baselines in Subtask 2, with language-dependent differences. The study highlights practical potential for updating dictionaries with novel senses but also notes limitations from small datasets, encoder choices, and data contamination concerns, suggesting directions for larger resources and safer evaluation in future work.

Abstract

There has been a surge of interest in computational modeling of semantic change. The foci of previous works are on detecting and interpreting word senses gained over time; however, it remains unclear whether the gained senses are covered by dictionaries. In this work, we aim to fill this research gap by comparing detected word senses with dictionary sense inventories in order to bridge between the communities of lexical semantic change detection and lexicography. We evaluate our system in the AXOLOTL-24 shared task for Finnish, Russian and German languages \cite{fedorova-etal-2024-axolotl}. Our system is fully unsupervised. It leverages a graph-based clustering approach to predict mappings between unknown word usages and dictionary entries for Subtask 1, and generates dictionary-like definitions for those novel word usages through the state-of-the-art Large Language Models such as GPT-4 and LLaMA-3 for Subtask 2. In Subtask 1, our system outperforms the baseline system by a large margin, and it offers interpretability for the mapping results by distinguishing between matched and unmatched (novel) word usages through our graph-based clustering approach. Our system ranks first in Finnish and German, and ranks second in Russian on the Subtask 2 test-phase leaderboard. These results show the potential of our system in managing dictionary entries, particularly for updating dictionaries to include novel sense entries. Our code and data are made publicly available\footnote{\url{https://github.com/xiaohemaikoo/axolotl24-ABDN-NLP}}.

Presence or Absence: Are Unknown Word Usages in Dictionaries?

TL;DR

This work addresses bridging lexical semantic change detection with lexicography by aligning diachronic word usages to dictionary senses and generating dictionary-like definitions for novel senses. It advances an unsupervised, graph-based clustering approach for Subtask 1 and employs prompt-based, training-free LLMs for Subtask 2, evaluated on AXOLOTL-24 across Finnish, Russian, and German. The results show the graph-based method outperforms the baseline in Subtask 1, while LLM-driven definition generation can outperform supervised baselines in Subtask 2, with language-dependent differences. The study highlights practical potential for updating dictionaries with novel senses but also notes limitations from small datasets, encoder choices, and data contamination concerns, suggesting directions for larger resources and safer evaluation in future work.

Abstract

There has been a surge of interest in computational modeling of semantic change. The foci of previous works are on detecting and interpreting word senses gained over time; however, it remains unclear whether the gained senses are covered by dictionaries. In this work, we aim to fill this research gap by comparing detected word senses with dictionary sense inventories in order to bridge between the communities of lexical semantic change detection and lexicography. We evaluate our system in the AXOLOTL-24 shared task for Finnish, Russian and German languages \cite{fedorova-etal-2024-axolotl}. Our system is fully unsupervised. It leverages a graph-based clustering approach to predict mappings between unknown word usages and dictionary entries for Subtask 1, and generates dictionary-like definitions for those novel word usages through the state-of-the-art Large Language Models such as GPT-4 and LLaMA-3 for Subtask 2. In Subtask 1, our system outperforms the baseline system by a large margin, and it offers interpretability for the mapping results by distinguishing between matched and unmatched (novel) word usages through our graph-based clustering approach. Our system ranks first in Finnish and German, and ranks second in Russian on the Subtask 2 test-phase leaderboard. These results show the potential of our system in managing dictionary entries, particularly for updating dictionaries to include novel sense entries. Our code and data are made publicly available\footnote{\url{https://github.com/xiaohemaikoo/axolotl24-ABDN-NLP}}.
Paper Structure (33 sections, 6 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: An illustration of the workflow for the two AXOLOTL-24 subtasks. Unknown word usages refer to usages found at a later time period, and their mappings with dictionary sense entries are unknown.
  • Figure 2: A running example for the target word 'palaus' from the Finnish test set. The first two usages (before 1700) belong to the earlier time period while the last one belongs to the later.
  • Figure 3: An illustration of our semantic graph for the Finnish target word 'kupari' (root node in the graph), together with two subtrees separating two meaning clusters. One cluster represents the meaning related to a metal (in black) that is covered by dictionaries while the other represents the novel meaning 'the recipient of metals as currency' (in blue) that is not. Each cluster contains 4-nearest neighboring words, together with their corpus usage IDs, to interpret the collective meaning of the cluster.
  • Figure 4: A well-generated definition in Russian.
  • Figure 5: A poorly-generated definition in Russian.
  • ...and 1 more figures