Table of Contents
Fetching ...

Generating bilingual example sentences with large language models as lexicography assistants

Raphael Merx, Ekaterina Vylomova, Kemal Kurniawan

TL;DR

It is demonstrated that in-context learning can successfully align LLMs with individual annotator preferences, and the use of pre-trained language models for automated rating of examples is explored, finding that sentence perplexity serves as a good proxy for “typicality” and “intelligibility” in higher-resourced languages.

Abstract

We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.

Generating bilingual example sentences with large language models as lexicography assistants

TL;DR

It is demonstrated that in-context learning can successfully align LLMs with individual annotator preferences, and the use of pre-trained language models for automated rating of examples is explored, finding that sentence perplexity serves as a good proxy for “typicality” and “intelligibility” in higher-resourced languages.

Abstract

We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.
Paper Structure (37 sections, 2 figures, 7 tables)

This paper contains 37 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of our process for generating example sentence pairs using LLMs.
  • Figure 2: Rating distributions (GPT-4o and Llama 3.1 combined) for GDEX criteria and translation correctness.