Table of Contents
Fetching ...

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Mohsinul Kabir, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, M Saiful Bari, Enamul Hoque

TL;DR

BenLLM-Eval investigates the viability of zero-shot large language models for Bengali NLP, a low-resource language. The authors assemble seven Bengali tasks across eight benchmarks and evaluate GPT-3.5, LLaMA-2-13b-chat, and Claude-2 using carefully crafted prompts without fine-tuning. The results show that LLMs often lag behind state-of-the-art fine-tuned models, though there are exceptions where GPT-3.5 or Claude-2 approach or exceed SOTA in specific tasks; open-source LLaMA-2-13b-chat generally underperforms. The paper also reveals task contamination risks and discusses limitations and ethical considerations, proposing directions for expanding language coverage and dataset resources in future work.

Abstract

Large Language Models (LLMs) have emerged as one of the most important breakthroughs in NLP for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). To this end, this paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the Bengali language that has modest resources. In this regard, we select various important and diverse Bengali NLP tasks, such as text summarization, question answering, paraphrasing, natural language inference, transliteration, text classification, and sentiment analysis for zero-shot evaluation of popular LLMs, namely, GPT-3.5, LLaMA-2-13b-chat, and Claude-2. Our experimental results demonstrate that while in some Bengali NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models; in most tasks, their performance is quite poor (with the performance of open-source LLMs like LLaMA-2-13b-chat being significantly bad) in comparison to the current SOTA results. Therefore, it calls for further efforts to develop a better understanding of LLMs in modest-resourced languages like Bengali.

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

TL;DR

BenLLM-Eval investigates the viability of zero-shot large language models for Bengali NLP, a low-resource language. The authors assemble seven Bengali tasks across eight benchmarks and evaluate GPT-3.5, LLaMA-2-13b-chat, and Claude-2 using carefully crafted prompts without fine-tuning. The results show that LLMs often lag behind state-of-the-art fine-tuned models, though there are exceptions where GPT-3.5 or Claude-2 approach or exceed SOTA in specific tasks; open-source LLaMA-2-13b-chat generally underperforms. The paper also reveals task contamination risks and discusses limitations and ethical considerations, proposing directions for expanding language coverage and dataset resources in future work.

Abstract

Large Language Models (LLMs) have emerged as one of the most important breakthroughs in NLP for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). To this end, this paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the Bengali language that has modest resources. In this regard, we select various important and diverse Bengali NLP tasks, such as text summarization, question answering, paraphrasing, natural language inference, transliteration, text classification, and sentiment analysis for zero-shot evaluation of popular LLMs, namely, GPT-3.5, LLaMA-2-13b-chat, and Claude-2. Our experimental results demonstrate that while in some Bengali NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models; in most tasks, their performance is quite poor (with the performance of open-source LLMs like LLaMA-2-13b-chat being significantly bad) in comparison to the current SOTA results. Therefore, it calls for further efforts to develop a better understanding of LLMs in modest-resourced languages like Bengali.
Paper Structure (5 sections, 1 equation, 1 figure, 3 tables)

This paper contains 5 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Confusion matrices for different LLMs on the BNLI dataset for the NLI task.