Table of Contents
Fetching ...

Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking

Emre Can Acikgoz, Mete Erdogan, Deniz Yuret

TL;DR

This work tackles the challenge of building high-quality Turkish LLMs under resource constraints by evaluating two strategies: adapting English-pretrained bases (Hamza variants) through continued pretraining with Turkish data and training a Turkish base model from scratch (Hamza series) on Turkish corpora. It introduces a Turkish instruction-tuning dataset (Self-Instruct) and a Turkish LLM leaderboard with ARC-TR and TruthfulQA-TR benchmarks to assess reasoning and factual accuracy. Key contributions include the Hamza family of models (124M–1.3B), the release of 300B-token-scale Turkish pretraining data from CulturaX, mC4, and OSCAR, and the creation/validation of TruthfulQA-TR and ARC-TR datasets with thorough annotation metrics. Findings show that adapting strong base models can yield competitive Turkish performance but risks catastrophic forgetting, while from-scratch training benefits from large, high-quality Turkish data; together these insights offer a practical roadmap for advancing Turkish NLP and, more broadly, LLMs for low-resource languages. The work provides open-source code, datasets, and a Turkish benchmark framework to catalyze future research and broader linguistic inclusion in NLP.

Abstract

Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.

Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking

TL;DR

This work tackles the challenge of building high-quality Turkish LLMs under resource constraints by evaluating two strategies: adapting English-pretrained bases (Hamza variants) through continued pretraining with Turkish data and training a Turkish base model from scratch (Hamza series) on Turkish corpora. It introduces a Turkish instruction-tuning dataset (Self-Instruct) and a Turkish LLM leaderboard with ARC-TR and TruthfulQA-TR benchmarks to assess reasoning and factual accuracy. Key contributions include the Hamza family of models (124M–1.3B), the release of 300B-token-scale Turkish pretraining data from CulturaX, mC4, and OSCAR, and the creation/validation of TruthfulQA-TR and ARC-TR datasets with thorough annotation metrics. Findings show that adapting strong base models can yield competitive Turkish performance but risks catastrophic forgetting, while from-scratch training benefits from large, high-quality Turkish data; together these insights offer a practical roadmap for advancing Turkish NLP and, more broadly, LLMs for low-resource languages. The work provides open-source code, datasets, and a Turkish benchmark framework to catalyze future research and broader linguistic inclusion in NLP.

Abstract

Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.
Paper Structure (72 sections, 7 equations, 1 figure, 12 tables)

This paper contains 72 sections, 7 equations, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Accuracy comparison of Continued Pretrained models on English (Left, Right) and Turkish (Right) question answering tasks and demonstrating the original language catastrophic forgetting while learning the new language. In the table on the left, the performance of our Hamza$_{\scriptsize Mistral}$ and Hamza$_{\scriptsize GPT2-xl}$ models that are adapted on Turkish together with the original Mistral 7B and GPT2-xl. We present the result of our ablation study, where the performance of the adapted models is given by progressively enlarging the pretraining corpus size from 0.1 GB to 5 GB. Here, the zero and few-show accuracies were evaluated on the original ARC and TruthfulQA. The figure on the right illustrates the Mistral model's results on both Turkish and English versions of the ARC dataset, highlighting its improved performance in Turkish and decreasing performance in English with continued pretraining.