VBART: The Turkish LLM
Meliksah Turker, Mehmet Erdi Ari, Aydin Han
TL;DR
VBART presents the first Turkish sequence-to-sequence LLMs trained from scratch, delivering two sizes (Large and XLarge) and a monolingual SentencePiece Unigram tokenizer. Built on an mBART-inspired encoder–decoder with BART-like pretraining, VBART achieves state-of-the-art results across summarization, title generation, paraphrasing, and QG/QA tasks, while offering a data-efficient, scalable path for Turkish NLP. The work includes a 135.7 GB cleaned Turkish corpus, dynamic data generation, and a model enlargement technique to produce XLarge, and it critically examines the applicability of Chinchilla scaling to encoder–decoder models. The authors also release the tokenizer and models on HuggingFace, illustrating the practical benefits of language-specific pretraining for low-resource languages and opening avenues for further enlargement and architecture variation. Overall, VBART demonstrates that dedicated Turkish LLMs can outperform multilingual counterparts with substantially fewer parameters and greater efficiency, accelerating Turkish NLP research and deployment.
Abstract
We present VBART, the first Turkish sequence-to-sequence Large Language Models (LLMs) pre-trained on a large corpus from scratch. VBART are compact LLMs based on good ideas leveraged from BART and mBART models and come in two sizes, Large and XLarge. Fine-tuned VBART models surpass the prior state-of-the-art results in abstractive text summarization, title generation, text paraphrasing, question answering and question generation tasks. They allow fine-tuning for future text generation tasks and datasets, carving a new path for Turkish Natural Language Processing (NLP) research. Our work shows that having a pre-trained LLM for Turkish outperforms up to 3x multilingual models, improving existing results and providing efficient models for training and inference. Moreover, we show that our monolingual tokenizer is up to 11x more efficient than multilingual tokenizers. Last but not least, we introduce a method to enlarge an existing pre-trained LLM and question the relevancy of Chinchilla Scaling Law to sequence-to-sequence masked language models. Our fine-tuned models, tokenizer and cleaned vngrs-web-corpus of 135 GB are publicly available at huggingface.co/vngrs-ai.
