The PLLuM Instruction Corpus
Piotr Pęzik, Filip Żarnecki, Konrad Kaczyński, Anna Cichosz, Zuzanna Deckert, Monika Garnys, Izabela Grabarczyk, Wojciech Janowski, Sylwia Karasińska, Aleksandra Kujawiak, Piotr Misztela, Maria Szymańska, Karolina Walkusz, Igor Siek, Maciej Chrabąszcz, Anna Kołos, Agnieszka Karlińska, Karolina Seweryn, Aleksandra Krasnodębska, Paula Betscher, Zofia Cieślińska, Katarzyna Kowol, Artur Wilczek, Maciej Trzciński, Katarzyna Dziewulska, Roman Roszko, Tomasz Bernaś, Jurgita Vaičenonienė, Danuta Roszko, Paweł Levchuk, Paweł Kowalski, Irena Prawdzic-Jankowska, Marek Kozłowski, Sławomir Dadas, Rafał Poświata, Alina Wróblewska, Katarzyna Krasnowska-Kieraś, Maciej Ogrodniczuk, Michał Rudolf, Piotr Rybak, Karolina Saputa, Joanna Wołoszyn, Marcin Oleksy, Bartłomiej Koptyra, Teddy Ferdinan, Stanisław Woźniak, Maciej Piasecki, Paweł Walkowiak, Konrad Wojtasik, Arkadiusz Janz, Przemysław Kazienko, Julia Moska, Jan Kocoń
TL;DR
This work presents the PLLuM Instruction Corpus (PLLuMIC) and its typology (Organic, Converted, Synthetic) as a foundation for fine-tuning Polish LLMs. It details multiple data-generation pipelines (manual annotation, multi-turn dialogues, knowledge distillation, RAG, and context-injected NLP) and discusses transparency, quality control, and limitations of synthetic data. Language-adaptation experiments show that substantial continual Polish pretraining combined with PLLuMIC fine-tuning improves linguistic-cultural competency, while alignment on English-dominated data can introduce negative transfer and longer, less idiomatic outputs. The authors release a representative PLLuMIC subset to guide similar dataset development and highlight the importance of high-quality, language-specific organic data for robust linguistically adapted LLMs.
Abstract
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
