Table of Contents
Fetching ...

Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

Jinu Nyachhyon, Mridul Sharma, Prajwal Thapa, Bal Krishna Bal

TL;DR

The paper addresses the limited scope of Nepali NLP evaluation by introducing the NLUE benchmark, a GLUE/XGLUE-inspired suite with 12 Nepali NLU tasks spanning sentiment, acceptability, commonsense reasoning, paraphrase, similarity, inference, coreference, and a General Masked Evaluation Task. Datasets are created through a mix of translating English benchmarks using LLMs (e.g., GPT-4o-mini, Gemini-2.5-flash) and manual curation to ensure linguistic relevance, with careful quality control including back-translation and category-specific checks. Ten models, including monolingual Nepali and multilingual variants, are evaluated under varied finetuning configurations and hyperparameters, revealing that multilingual models generally offer stronger semantic capabilities while Nepali-specific models excel in certain single-sentence tasks, but coreference and small-task performance remain challenging. The NLUE benchmark provides a robust, diverse platform for advancing Nepali NLP, highlighting the value of cross-lingual transfer and the need for larger, more varied Nepali datasets to drive progress in low-resource language understanding.

Abstract

The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.

Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

TL;DR

The paper addresses the limited scope of Nepali NLP evaluation by introducing the NLUE benchmark, a GLUE/XGLUE-inspired suite with 12 Nepali NLU tasks spanning sentiment, acceptability, commonsense reasoning, paraphrase, similarity, inference, coreference, and a General Masked Evaluation Task. Datasets are created through a mix of translating English benchmarks using LLMs (e.g., GPT-4o-mini, Gemini-2.5-flash) and manual curation to ensure linguistic relevance, with careful quality control including back-translation and category-specific checks. Ten models, including monolingual Nepali and multilingual variants, are evaluated under varied finetuning configurations and hyperparameters, revealing that multilingual models generally offer stronger semantic capabilities while Nepali-specific models excel in certain single-sentence tasks, but coreference and small-task performance remain challenging. The NLUE benchmark provides a robust, diverse platform for advancing Nepali NLP, highlighting the value of cross-lingual transfer and the need for larger, more varied Nepali datasets to drive progress in low-resource language understanding.

Abstract

The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.

Paper Structure

This paper contains 58 sections, 6 equations, 35 figures, 8 tables.

Figures (35)

  • Figure 1: Different training config based on parameters with initial FC Layer
  • Figure 2: Different training config based on parameters without initial FC Layer
  • Figure 3: SA Positive (1) and Negative (0) Sample
  • Figure 4: CoLA Positive (1) and Negative (0) Sample
  • Figure 5: WG Sample
  • ...and 30 more figures