Table of Contents
Fetching ...

BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

Daeen Kabir, Minhajur Rahman Chowdhury Mahim, Sheikh Shafayat, Adnan Sadik, Arian Ahmed, Eunsu Kim, Alice Oh

TL;DR

Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, the results indicate Bengali's status as a mid-resource language.

Abstract

In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

TL;DR

Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, the results indicate Bengali's status as a mid-resource language.

Abstract

In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

Paper Structure

This paper contains 19 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of a MCQA sample from the Semantics domain. To aid linguistic understanding of non-native Bengali speakers, we include the original Bengali script, romanization, and word-by-word glossing, following the Leipzig Glossing Rules, alongside the English translation.
  • Figure 2: Prompt Structure for 5-shot setting using GPT model.
  • Figure 3: Prompting strategies used in our evaluation.
  • Figure 4: Comparison of accuracy across history and culture domains under 0-shot and 5-shot settings.
  • Figure 5: Comparison of accuracy across phonetics and semantics domains under 0-shot and 5-shot settings.
  • ...and 3 more figures