Table of Contents
Fetching ...

Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Banu Diri, Savaş Yıldırım, Öner Aytaş

TL;DR

The paper introduces TR-MMLU, a native Turkish benchmark designed to evaluate large language models’ linguistic and conceptual understanding using 6,200 curriculum-derived questions across 62 categories. By emphasizing knowledge assessment over instruction-following and ensuring dataset transparency and non-overlap with pretraining data, TR-MMLU provides a culturally relevant, reproducible framework evaluated on 39 LLMs with prompting and semantic matching. Key findings show that Turkish-specific tokenization and domain-focused fine-tuning boost performance, while issues like catastrophic forgetting and data scarcity limit gains, underscoring the need for tailored tokenization, robust training strategies, and expanded Turkish datasets. Collectively, TR-MMLU sets a new standard for Turkish NLP evaluation and aims to guide future research and tooling to advance resource-limited language processing in Turkish contexts.

Abstract

Language models have made remarkable advancements in understanding and generating human language, achieving notable success across a wide array of applications. However, evaluating these models remains a significant challenge, particularly for resource-limited languages such as Turkish. To address this gap, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is constructed from a carefully curated dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system. This benchmark provides a transparent, reproducible, and culturally relevant tool for evaluating model performance. It serves as a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text and fostering the development of more robust and accurate language models. In this study, we evaluate state-of-the-art LLMs on TR-MMLU, providing insights into their strengths and limitations for Turkish-specific tasks. Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design. By setting a new standard for evaluating Turkish language models, TR-MMLU aims to inspire future innovations and support the advancement of Turkish NLP research.

Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation

TL;DR

The paper introduces TR-MMLU, a native Turkish benchmark designed to evaluate large language models’ linguistic and conceptual understanding using 6,200 curriculum-derived questions across 62 categories. By emphasizing knowledge assessment over instruction-following and ensuring dataset transparency and non-overlap with pretraining data, TR-MMLU provides a culturally relevant, reproducible framework evaluated on 39 LLMs with prompting and semantic matching. Key findings show that Turkish-specific tokenization and domain-focused fine-tuning boost performance, while issues like catastrophic forgetting and data scarcity limit gains, underscoring the need for tailored tokenization, robust training strategies, and expanded Turkish datasets. Collectively, TR-MMLU sets a new standard for Turkish NLP evaluation and aims to guide future research and tooling to advance resource-limited language processing in Turkish contexts.

Abstract

Language models have made remarkable advancements in understanding and generating human language, achieving notable success across a wide array of applications. However, evaluating these models remains a significant challenge, particularly for resource-limited languages such as Turkish. To address this gap, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is constructed from a carefully curated dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system. This benchmark provides a transparent, reproducible, and culturally relevant tool for evaluating model performance. It serves as a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text and fostering the development of more robust and accurate language models. In this study, we evaluate state-of-the-art LLMs on TR-MMLU, providing insights into their strengths and limitations for Turkish-specific tasks. Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design. By setting a new standard for evaluating Turkish language models, TR-MMLU aims to inspire future innovations and support the advancement of Turkish NLP research.
Paper Structure (5 sections, 2 tables)