Table of Contents
Fetching ...

Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol

TL;DR

This work addresses the lack of robust evaluation benchmarks for Dzongkha in multilingual QA by introducing DZEN, a 5k+-question Dzongkha-English benchmark aligned with Bhutan's national curriculum. It analyzes multiple LLMs, revealing a substantial English–Dzongkha performance gap for several models, and shows that Chain-of-Thought prompting and translation-augmented prompts can improve Dzongkha performance, with varying effectiveness across subjects and question types. The study also demonstrates that open-source models lag behind proprietary ones, and that translating queries into English or appending English translations can enhance reasoning outcomes in Dzongkha. Collectively, these findings highlight opportunities to strengthen multilingual foundation models for low-resource languages and guide future research in cross-language evaluation and prompting strategies.

Abstract

In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.

Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models

TL;DR

This work addresses the lack of robust evaluation benchmarks for Dzongkha in multilingual QA by introducing DZEN, a 5k+-question Dzongkha-English benchmark aligned with Bhutan's national curriculum. It analyzes multiple LLMs, revealing a substantial English–Dzongkha performance gap for several models, and shows that Chain-of-Thought prompting and translation-augmented prompts can improve Dzongkha performance, with varying effectiveness across subjects and question types. The study also demonstrates that open-source models lag behind proprietary ones, and that translating queries into English or appending English translations can enhance reasoning outcomes in Dzongkha. Collectively, these findings highlight opportunities to strengthen multilingual foundation models for low-resource languages and guide future research in cross-language evaluation and prompting strategies.

Abstract

In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.

Paper Structure

This paper contains 52 sections, 20 figures, 9 tables.

Figures (20)

  • Figure 1: CoT reasoning average score in few-shot scenarios for the English GPT 3.5 Turbo. Note that w/ denotes with and w/o denotes without CoT.
  • Figure 2: Subject-by-subject CoT performance in English. Note that w/ denotes with and w/o denotes without CoT.
  • Figure 3: Performance summary in English by question type. Note that w/ denotes with and w/o denotes without CoT.
  • Figure 4: Dzongkha and English k-shot CoT prompted on GPT 3.5.
  • Figure 5: Impact of CoT reasoning on the GPT-3.5 for DZEN English throughout k-shot. Note that w/ denotes with and w/o denotes without CoT.
  • ...and 15 more figures