Table of Contents
Fetching ...

Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

Emil Nuutinen, Iiro Rastas, Filip Ginter

TL;DR

The paper tackles the paucity of non-English QA datasets by introducing a markup-based transfer method that leverages DeepL's ability to translate formatted documents to preserve span annotations. Applied to SQuAD2.0, this yields Finnish SQuAD2.0 with high coverage and enables training strong Finnish extractive QA models, notably FinBERT-based architectures. The approach outperforms prior Finnish MT transfer methods and shows potential applicability to other languages and tasks, albeit with reliance on a commercial MT service. Overall, the method provides a practical, scalable pathway to expand QA resources across languages, with quantified trade-offs from MT-induced errors.

Abstract

We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.

Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

TL;DR

The paper tackles the paucity of non-English QA datasets by introducing a markup-based transfer method that leverages DeepL's ability to translate formatted documents to preserve span annotations. Applied to SQuAD2.0, this yields Finnish SQuAD2.0 with high coverage and enables training strong Finnish extractive QA models, notably FinBERT-based architectures. The approach outperforms prior Finnish MT transfer methods and shows potential applicability to other languages and tasks, albeit with reliance on a commercial MT service. Overall, the method provides a practical, scalable pathway to expand QA resources across languages, with quantified trade-offs from MT-induced errors.

Abstract

We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.
Paper Structure (14 sections, 1 figure, 4 tables)

This paper contains 14 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Example of the colored answer spans from an actual SQuAD passage: the original English passage (top), its Finnish translation (middle), and its backtranslation from Finnish into English (bottom). This example is shown as-is without any manual corrections (other than adjusting colors for better readability). Note the two overlapping answers documents obtained by WikiLeaks and WikiLeaks at the very beginning of the passage.