Multilingual Open QA on the MIA Shared Task
Navya Yarrabelly, Saloni Mittal, Ketan Todi, Kimihiro Hasegawa
TL;DR
The paper addresses cross-lingual information retrieval and multilingual open QA in low-resource languages without supervision. It introduces a zero-shot Question-Generation based Re-ranking (QGPR) method that re-scores passages from a multilingual dense retriever by estimating $p(q|z)$ with a pretrained multilingual LM, and separately evaluates machine-translation based data augmentation. Experiments on XOR-TYDI-QA show that QGPR yields consistent gains in cross-lingual retrieval for several languages (notably Korean and Japanese), while MT augmentation produces mixed, often limited QA improvements likely due to context-length and translation quality constraints. Overall, the approach provides a training-free enhancement that can augment existing retrieval pipelines and offers insights into language-resource effects in cross-lingual open QA.
Abstract
Cross-lingual information retrieval (CLIR) ~\cite{shi2021cross, asai2021one, jiang2020cross} for example, can find relevant text in any language such as English(high resource) or Telugu (low resource) even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where we do not require any kind of additional supervision or labelled data for retrieval task and hence can work effectively for low-resource languages. \par We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retrieved passages with a zero-shot multilingual question generation model, which is a pre-trained language model, to compute the probability of the input question in the target language conditioned on a retrieved passage, which can be possibly in a different language. We evaluate our method in a completely zero shot setting and doesn't require any training. Thus the main advantage of our method is that our approach can be used to re-rank results obtained by any sparse retrieval methods like BM-25. This eliminates the need for obtaining expensive labelled corpus required for the retrieval tasks and hence can be used for low resource languages.
