BgGPT 1.0: Extending English-centric LLMs to other languages
Anton Alexandrov, Veselin Raychev, Dimitar I. Dimitrov, Ce Zhang, Martin Vechev, Kristina Toutanova
TL;DR
This work presents BgGPT-Gemma-2, a Bulgarian-focused extension of the Gemma-2 multilingual LLM family, engineered to elevate Bulgarian language understanding without sacrificing English performance. The authors combine continual pretraining with Branch-and-Merge to mitigate forgetting, using a data split that prioritizes Bulgarian content while preserving English skills, and train on a large, curated Bulgarian corpus (~46M high-quality documents) adapted from RedPajama pipelines. They further refine the model via supervised fine-tuning with diverse Bulgarian data sources (translated datasets, native conversations, toxicity filters, human preferences, and rhyming content), followed by strategic model merges with Gemma-2-27b-it to inject instruction-tuning strengths. Evaluations across Bulgarian benchmarks—both translated and native—and a Bulgarian-focused educational/chat benchmark show BgGPT-Gemma-2-27B-Instruct achieving state-of-the-art Bulgarian performance while preserving substantial English capabilities, with a publicly available, commercially friendly license enabling broad adoption. The work demonstrates that robust, language-specific LLMs can be built atop strong English-centric bases, offering practical impact for Bulgarian education, bilingual chat systems, and domain-specific knowledge tools, while also highlighting data quality, transfer challenges, and ethical considerations in open-weight LLM deployment.
Abstract
We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google's Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2's multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a commercial-friendly license, enabling broader adoption by researchers, companies, and hobbyists. Further, we establish a comprehensive set of benchmarks based on non-public educational data sources to evaluate models on Bulgarian language tasks as well as safety and chat capabilities. Our findings demonstrate the effectiveness of fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific AI applications while maintaining cross-lingual capabilities.
