Table of Contents
Fetching ...

Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications

Funghang Limbu Begha, Praveen Acharya, Bal Krishna Bal

Abstract

Nepali, a low-resource language, faces significant challenges in building an effective information retrieval system due to the unavailability of annotated data and computational linguistic resources. In this study, we attempt to address this gap by preparing a pair-structured Nepali Question-Answer dataset. We focus on Frequently Asked Questions (FAQs) for passport-related services, building a data set for training and evaluation of IR models. In our study, we have fine-tuned transformer-based embedding models for semantic similarity in question-answer retrieval. The fine-tuned models were compared with the baseline BM25. In addition, we implement a hybrid retrieval approach, integrating fine-tuned models with BM25, and evaluate the performance of the hybrid retrieval. Our results show that the fine-tuned SBERT-based models outperform BM25, whereas multilingual E5 embedding-based models achieve the highest retrieval performance among all evaluated models.

Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications

Abstract

Nepali, a low-resource language, faces significant challenges in building an effective information retrieval system due to the unavailability of annotated data and computational linguistic resources. In this study, we attempt to address this gap by preparing a pair-structured Nepali Question-Answer dataset. We focus on Frequently Asked Questions (FAQs) for passport-related services, building a data set for training and evaluation of IR models. In our study, we have fine-tuned transformer-based embedding models for semantic similarity in question-answer retrieval. The fine-tuned models were compared with the baseline BM25. In addition, we implement a hybrid retrieval approach, integrating fine-tuned models with BM25, and evaluate the performance of the hybrid retrieval. Our results show that the fine-tuned SBERT-based models outperform BM25, whereas multilingual E5 embedding-based models achieve the highest retrieval performance among all evaluated models.
Paper Structure (18 sections, 3 figures, 5 tables)

This paper contains 18 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Extracted FAQS from the different websites
  • Figure 2: Token count distribution across the Nepali question-answer pair dataset for training, validation, and test splits using the multilingual intfloat/e5-large tokenizer. The histogram shows the number of tokens per pair, calculated by summing the tokens of the query and its corresponding positive entry, highlighting the overall sequence length patterns in the dataset.
  • Figure 3: Workflow of the proposed information retrieval framework, which evaluates lexical (BM25), fine-tuned embedding-based model, and hybrid (BM25 + intfloat/e5-base) retrieval models, followed by statistical significance testing against the BM25 baseline.