Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian
Aleksa Cvetanović, Predrag Tadić
TL;DR
This work tackles the data scarcity of Serbian QA by generating a large synthetic dataset, SQuAD-sr, via an adapted Translate-Align-Retrieve pipeline from SQuAD v1.1. The authors compare monolingual and multilingual pre-trained models, finding that monolingual BERTić fine-tuned on Latin-script data delivers the strongest results (EM 73.91, F1 82.97) on a translated Serbian XQuAD benchmark, surpassing zero-shot baselines but not human performance. They demonstrate the advantages of Latin over Cyrillic script in fine-tuning and provide extensive analysis by question type and dataset properties. The work confirms the viability of synthetic data for low-resource languages and releases both the dataset and the best-performing model to the community, with implications for rapid QA deployment in Serbian and similar languages.
Abstract
In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTić model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
