Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui; Manuel Mager; Peter Herbert Kann; Katharina von der Wense

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense

Abstract

Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Abstract

Paper Structure (56 sections, 9 figures, 7 tables)

This paper contains 56 sections, 9 figures, 7 tables.

Introduction
Contributions
Related Work
German Dialect Datasets
NLP for German Dialects
Dataset: A Dictionary for the Dialect of Mainz
"Meenzerisch"---A German Dialect
Linguistic Classification and Distribution
Historical Influences
Sociolinguistic Situation and Cultural Role
Dataset Creation
Scanning and OCR
Manual Clean-Up of OCR Output
Automatic Extraction using LLMs
Quality Control
...and 41 more sections

Figures (9)

Figure 1: Dataset Creation Pipeline. Overview of the semi-automatic five-step process used to create the Mainz Dialect Dataset.
Figure 2: Part of our File with Automatically Extracted Rules. The final rules are fed into Llama-3.3 70B for definition and word generation.
Figure 3: Prompt used for extracting dictionary definitions.English translation: "You receive an unstructured or faulty definition of the word '{Word}' from an old dictionary. Your task is to extract only the actual meaning of the word from the text, without comments, reformulations, or explanations. Rules: (1) If multiple meanings are present, number them consecutively (1., 2., 3., …) and separate them by line breaks. (2) If the definition refers exclusively to another word, output '[SEE] <word>'. (3) Do not alter the original text; output only the relevant excerpt. (4) If there is no definition in the text, output 'No definition'."
Figure 4: Prompt used for cleaning dictionary definitionsEnglish translation: "You are a careful linguistic assistant. Your task is to clean dictionary definitions without changing their meaning. Remove only unnecessary special characters such as hyphens, double spaces, or similar OCR artifacts. Leave references in the form [SEE] unchanged and preserve numbering of definitions. Here is the definition for the word '{Word}': {cleaned}. Clean the definition according to the instructions. If it is already clean, return it unchanged. Output only the cleaned definition."
Figure 5: Prompt used for generating dictionary definitions.English translation: "You are a precise and reliable linguistic assistant. Your task is to create dictionary definitions. Provide exclusively a single, short, and concise meaning of the requested word. Use the meaning commonly used in the Mainzer dialect, but formulate the definition in Standard German. Create exactly one short dictionary definition for the word '{Word}'. Use the meaning commonly used in the Mainzer dialect, without providing additional information, examples, or alternative meanings."
...and 4 more figures

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Abstract

Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Authors

Abstract

Table of Contents

Figures (9)