Table of Contents
Fetching ...

VBART: The Turkish LLM

Meliksah Turker, Mehmet Erdi Ari, Aydin Han

TL;DR

VBART presents the first Turkish sequence-to-sequence LLMs trained from scratch, delivering two sizes (Large and XLarge) and a monolingual SentencePiece Unigram tokenizer. Built on an mBART-inspired encoder–decoder with BART-like pretraining, VBART achieves state-of-the-art results across summarization, title generation, paraphrasing, and QG/QA tasks, while offering a data-efficient, scalable path for Turkish NLP. The work includes a 135.7 GB cleaned Turkish corpus, dynamic data generation, and a model enlargement technique to produce XLarge, and it critically examines the applicability of Chinchilla scaling to encoder–decoder models. The authors also release the tokenizer and models on HuggingFace, illustrating the practical benefits of language-specific pretraining for low-resource languages and opening avenues for further enlargement and architecture variation. Overall, VBART demonstrates that dedicated Turkish LLMs can outperform multilingual counterparts with substantially fewer parameters and greater efficiency, accelerating Turkish NLP research and deployment.

Abstract

We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.

VBART: The Turkish LLM

TL;DR

VBART presents the first Turkish sequence-to-sequence LLMs trained from scratch, delivering two sizes (Large and XLarge) and a monolingual SentencePiece Unigram tokenizer. Built on an mBART-inspired encoder–decoder with BART-like pretraining, VBART achieves state-of-the-art results across summarization, title generation, paraphrasing, and QG/QA tasks, while offering a data-efficient, scalable path for Turkish NLP. The work includes a 135.7 GB cleaned Turkish corpus, dynamic data generation, and a model enlargement technique to produce XLarge, and it critically examines the applicability of Chinchilla scaling to encoder–decoder models. The authors also release the tokenizer and models on HuggingFace, illustrating the practical benefits of language-specific pretraining for low-resource languages and opening avenues for further enlargement and architecture variation. Overall, VBART demonstrates that dedicated Turkish LLMs can outperform multilingual counterparts with substantially fewer parameters and greater efficiency, accelerating Turkish NLP research and deployment.

Abstract

We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.
Paper Structure (31 sections, 2 figures, 5 tables)

This paper contains 31 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Pre-training loss of VBART models. Note that there are two steep drops in loss towards the end of the training. This is due to the reduction in Dropout from 0.10 to 0.05 and then to 0.
  • Figure 2: Relative number of tokens spent to encode Turkish text in average. The numbers below show the vocabulary size of each tokenizer. Despite their large vocabulary size, multilingual tokenizers spend up to 96% more tokens than VBART Tokenizer.