What are the limits of cross-lingual dense passage retrieval for low-resource languages?
Jie Wu, Zhaochun Ren, Suzan Verberne
TL;DR
This work probes the limits of cross-lingual dense passage retrieval for extremely low-resource languages, focusing on Amharic and Khmer within the CORA framework. By post-training mBERT with MLM and Translation Language Modeling and fine-tuning via cross-lingual question–passage alignment, the study extends mDPR with enlarged vocabularies (amBERT/kmBERT) and curated sentence-aligned datasets. Results show modest gains over a strong multilingual baseline, with language-alignment sometimes improving Amharic/Khmer QA but often keeping performance low, underscoring intertwined challenges in model capacity, data quality, and evaluation. The work highlights the need for more high-quality low-resource data, better evaluation paradigms, and continued exploration of alignment strategies to realize CORA’s multilingual open QA potential for languages with scarce resources.
Abstract
In this paper, we analyze the capabilities of the multi-lingual Dense Passage Retriever (mDPR) for extremely low-resource languages. In the Cross-lingual Open-Retrieval Answer Generation (CORA) pipeline, mDPR achieves success on multilingual open QA benchmarks across 26 languages, of which 9 were unseen during training. These results are promising for Question Answering (QA) for low-resource languages. We focus on two extremely low-resource languages for which mDPR performs poorly: Amharic and Khmer. We collect and curate datasets to train mDPR models using Translation Language Modeling (TLM) and question--passage alignment. We also investigate the effect of our extension on the language distribution in the retrieval results. Our results on the MKQA and AmQA datasets show that language alignment brings improvements to mDPR for the low-resource languages, but the improvements are modest and the results remain low. We conclude that fulfilling CORA's promise to enable multilingual open QA in extremely low-resource settings is challenging because the model, the data, and the evaluation approach are intertwined. Hence, all three need attention in follow-up work. We release our code for reproducibility and future work: https://anonymous.4open.science/r/Question-Answering-for-Low-Resource-Languages-B13C/
