Table of Contents
Fetching ...

Dense Retrieval for Low Resource Languages -- the Case of Amharic Language

Tilahun Yeshambel, Moncef Garouani, Serge Molina, Josiane Mothe

TL;DR

This work investigates dense retrieval for Amharic, a low-resource language with complex morphology and limited data, by evaluating ColBERTv2 under constrained resources. It constructs Amharic-focused datasets and compares dense ColBERT variants against BM25, showing meaningful gains after fine-tuning on Amharic data (e.g., $NDCG@10$ up to $0.704$ on 2AIRTC) while highlighting substantial indexing costs and infrastructure barriers. The findings suggest that with modest training data (around 150 examples) and Amharic-directed pretraining and fine-tuning, dense retrieval can approach or surpass traditional sparse methods, albeit with practical deployment challenges in low-resource settings. The study emphasizes the need for local AI infrastructure to support advanced IR research in Amharic and similar languages, given reliance on external compute resources.

Abstract

This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.

Dense Retrieval for Low Resource Languages -- the Case of Amharic Language

TL;DR

This work investigates dense retrieval for Amharic, a low-resource language with complex morphology and limited data, by evaluating ColBERTv2 under constrained resources. It constructs Amharic-focused datasets and compares dense ColBERT variants against BM25, showing meaningful gains after fine-tuning on Amharic data (e.g., up to on 2AIRTC) while highlighting substantial indexing costs and infrastructure barriers. The findings suggest that with modest training data (around 150 examples) and Amharic-directed pretraining and fine-tuning, dense retrieval can approach or surpass traditional sparse methods, albeit with practical deployment challenges in low-resource settings. The study emphasizes the need for local AI infrastructure to support advanced IR research in Amharic and similar languages, given reliance on external compute resources.

Abstract

This paper reports some difficulties and some results when using dense retrievers on Amharic, one of the low-resource languages spoken by 120 millions populations. The efforts put and difficulties faced by University Addis Ababa toward Amharic Information Retrieval will be developed during the presentation.

Paper Structure

This paper contains 3 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: NDCG@10 (a) 2AIRTC (b) AfriCLIRMatrix - First bars are without specific Amharic pre-training latest with.