Table of Contents
Fetching ...

OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context

Steffen Kleinle, Jakob Prange, Annemarie Friedrich

TL;DR

This work introduces OMoS-QA, a German-English extractive QA corpus tailored for online migration counseling, built from multilingual municipal documents and supplemented by crowd-sourced answer annotations. Questions are auto-generated with Mixtral and then manually filtered and annotated to ensure high agreement and accurate sentence-level evidence, including explicit handling of unanswerable queries. The authors benchmark five open-weight LLMs and a DeBERTa-based classifier on the dataset, revealing a consistent emphasis on high precision with moderate to low recall, and provide a cross-language pilot demonstrating partial robustness across languages. The dataset and findings advance faithful, multilingual QA for socio-political domains and offer a foundation for developing practical, service-ready online counseling tools with potential for broader language expansion.

Abstract

When immigrating to a new country, it is easy to feel overwhelmed by the need to obtain information on financial support, housing, schooling, language courses, and other issues. If relocation is rushed or even forced, the necessity for high-quality answers to such questions is all the more urgent. Official immigration counselors are usually overbooked, and online systems could guide newcomers to the requested information or a suitable counseling service. To this end, we present OMoS-QA, a dataset of German and English questions paired with relevant trustworthy documents and manually annotated answers, specifically tailored to this scenario. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers with high agreement. With our data, we conduct a comparison of 5 pretrained LLMs on the task of extractive question answering (QA) in German and English. Across all models and both languages, we find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users. This performance even holds up when the question language does not match the document language. When it comes to identifying unanswerable questions given a context, there are larger differences between the two languages.

OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context

TL;DR

This work introduces OMoS-QA, a German-English extractive QA corpus tailored for online migration counseling, built from multilingual municipal documents and supplemented by crowd-sourced answer annotations. Questions are auto-generated with Mixtral and then manually filtered and annotated to ensure high agreement and accurate sentence-level evidence, including explicit handling of unanswerable queries. The authors benchmark five open-weight LLMs and a DeBERTa-based classifier on the dataset, revealing a consistent emphasis on high precision with moderate to low recall, and provide a cross-language pilot demonstrating partial robustness across languages. The dataset and findings advance faithful, multilingual QA for socio-political domains and offer a foundation for developing practical, service-ready online counseling tools with potential for broader language expansion.

Abstract

When immigrating to a new country, it is easy to feel overwhelmed by the need to obtain information on financial support, housing, schooling, language courses, and other issues. If relocation is rushed or even forced, the necessity for high-quality answers to such questions is all the more urgent. Official immigration counselors are usually overbooked, and online systems could guide newcomers to the requested information or a suitable counseling service. To this end, we present OMoS-QA, a dataset of German and English questions paired with relevant trustworthy documents and manually annotated answers, specifically tailored to this scenario. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers with high agreement. With our data, we conduct a comparison of 5 pretrained LLMs on the task of extractive question answering (QA) in German and English. Across all models and both languages, we find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users. This performance even holds up when the question language does not match the document language. When it comes to identifying unanswerable questions given a context, there are larger differences between the two languages.
Paper Structure (29 sections, 4 equations, 6 figures, 8 tables)

This paper contains 29 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of our proposed task, system, and new dataset, OMoS-QA 1f4aa: After the user asks a question, the system retrieves relevant documents and extracts answer sentences. The system is evaluated using the OMoS-QA 1f4aa corpus.
  • Figure 2: OMoS-QA dataset creation. Documents are taken from real-life multilingual knowledge bases. Questions are generated using Mixtral, but answers are annotated manually using crowdsourcing. The double-annotated dataset is then filtered on a question-level according to inter-annotator agreement.
  • Figure 3: Gold standard construction from labels of two human annotators A1 (blue) and A2 (green). The gold standard contains sentences that A1 and A2 both mark as answers, as well as adjacent sentences marked by only one of them if at most three sentences away from the agreed-upon answer.
  • Figure 4: Test set performance as a function of the number of ground-truth answer sentences (0-shot Llama3-70B on German questions and documents).
  • Figure 5: Chunked samples for 5-shot experiments.
  • ...and 1 more figures