Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Renhao Pei; Siyao Peng; Verena Blaschke; Robert Litschko; Barbara Plank

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Renhao Pei, Siyao Peng, Verena Blaschke, Robert Litschko, Barbara Plank

Abstract

Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Abstract

Paper Structure (35 sections, 1 figure, 10 tables)

This paper contains 35 sections, 1 figure, 10 tables.

Introduction
Related Work
WiLoVa-QA Dataset
Target Language Varieties
Preprocessing Wikipedia Pages
Aligning Wikipedia pages
Local-heavy filtering
Removing direct translations and structural mismatches
Question-Answering Annotation
Lead section QA annotations
Verifying information asymmetry in the full document
Document-level QA annotations
Annotators
Quality Control
Experiments
...and 20 more sections

Figures (1)

Figure 1: QA performance measured by BERTScore on the ECLeKTic dataset. Results compare context types of question-only, +context, and +context (translated) across models.

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Abstract

Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Authors

Abstract

Table of Contents

Figures (1)