Efficient Language Adaptive Pre-training: Extending State-of-the-Art Large Language Models for Polish
Szymon Ruciński
TL;DR
This work tackles the challenge of developing a high-quality Polish language capability for large language models by applying Language Adaptive Pre-training (LAPT) to a base Polish-friendly model using a Polish corpus of modest size. Employing Low-Rank Adaptation (LoRA) to update a small parameter subset, the authors produce Curie-7B-v1, which achieves a perplexity of 3.02 and reaches near-SOTA performance on 8 of 9 KLEJ tasks while using only 2–3% of the data required by larger Polish models. The approach is demonstrated to be data- and energy-efficient, with training completed in about 106 GPU-hours and low monetary cost, and the resulting model is released as open-source. The work highlights the viability of language-specific adaptation via parameter-efficient fine-tuning for extending existing LLMs to new languages, setting a practical precedent for rapid, cost-effective multilingual deployment.
Abstract
This study explores the potential of fine-tuning foundational English Large Language Models (LLMs) for generating Polish text. The first step involves Language Adaptive Pre-training (LAPT) on a high-quality dataset of 3.11 GB, consisting of 276 million Polish tokens. The LAPT is followed by additional fine-tuning aimed at solving nine KLEJ challenges. Our trained model Curie-7B-v1 not only generates Polish text with the lowest perplexity of 3.02 among decoder-based Polish models but also closely rivals the performance of the best Polish encoder-decoder models with a less than 2% gap on 8 out of 9 tasks. Curie-7B-v1 used approximately 2-3% of a typical dataset size to learn Polish. The LAPT was completed in less than five days using a consumer GPU, highlighting the method's efficiency. The proficiency of the model in Polish was significantly enhanced, demonstrating the viability of this approach for adding new languages to existing LLMs by training just 1.2% of its parameters. To contribute to the community's collaborative progress, the model has been released as open-source.
