SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan; Erik Arakelyan; Pasquale Minervini; Isabelle Augenstein

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein

TL;DR

This work proposes $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages and shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

Abstract

Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

TL;DR

This work proposes

in, a method for generating and validating QA datasets for low-resource languages and shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

Abstract

in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain

paragraphs between English and the target language. We use the English data as context to

synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English

paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with

K samples for the Armenian language. The human evaluation shows that

of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out

of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

Paper Structure (23 sections, 1 equation, 6 figures, 6 tables)

This paper contains 23 sections, 1 equation, 6 figures, 6 tables.

Introduction
Methodology
Parallel Data Mining
QA Generation
Translation and Validation
Experimental Setup
QA Generation
Substring Matching and Semantic Similarity
Armenian QA Benchmarking
Results
English QA Dataset Generation
Dataset Diversity
Human Evaluation
Automatic Translation and Validation
Armenian QA dataset
...and 8 more sections

Figures (6)

Figure 1: The proposed framework is comprised of three components: (i) a module for mining parallel paragraphs using wiki-API and length matching; (ii) generating a synthetic question-answering dataset with an LLM using the mined English paragraphs; (iii) translating the question-answer pairs and Filtering/Validating them for obtaining a high-quality synthetic QA dataset in the low-resource language.
Figure 2: BERTopic embeddings similarity heatmap for the top 6 frequent topics in the mined English paragraphs.
Figure 3: The similarity heatmap of the top 6 frequent topics present within the mined English paragraphs.
Figure 4: The usage of frequent words in the top 6 frequent topics present within the mined English paragraphs.
Figure 5: Accuracy of each model with a varying number of in-context examples given before generation.
...and 1 more figures

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

TL;DR

Abstract

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (6)