Dense Retrieval for Low Resource Languages -- the Case of Amharic Language
Tilahun Yeshambel, Moncef Garouani, Serge Molina, Josiane Mothe
TL;DR
This work investigates dense retrieval for Amharic, a low-resource language with complex morphology and limited data, by evaluating ColBERTv2 under constrained resources. It constructs Amharic-focused datasets and compares dense ColBERT variants against BM25, showing meaningful gains after fine-tuning on Amharic data (e.g., $NDCG@10$ up to $0.704$ on 2AIRTC) while highlighting substantial indexing costs and infrastructure barriers. The findings suggest that with modest training data (around 150 examples) and Amharic-directed pretraining and fine-tuning, dense retrieval can approach or surpass traditional sparse methods, albeit with practical deployment challenges in low-resource settings. The study emphasizes the need for local AI infrastructure to support advanced IR research in Amharic and similar languages, given reliance on external compute resources.
Abstract
This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.
