Table of Contents
Fetching ...

Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation

Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas

TL;DR

Bielik 7B v0.1 is a Polish-focused 7B parameter language model developed to overcome data scarcity and linguistic adaptation challenges by leveraging curated Polish data, targeted post-training, and advanced fine-tuning strategies. The work introduces Weighted Instruction Cross-Entropy Loss and Adaptive Learning Rate to optimally learn from mixed-quality instruction data, and demonstrates strong Polish RAG capabilities and MT-Bench performance, including notable gains on RAG Reader tasks. Evaluations on the Open PL LLM Leaderboard and Polish MT-Bench show competitive results, with Bielik excelling in context-aware tasks and reasoning while maintaining solid overall performance. The paper also explores quantization, calibration, and efficient implementation to broaden accessibility, highlighting practical implications for deploying Polish NLP systems in resource-constrained environments $l(o_i, y_i) = - w_i \cdot \sum_{c=1}^C y_{i,c} \log p_{i,c}$ and $ALR = LR \cdot \sqrt{\frac{T}{BS}}$ as core training innovations, and emphasizes open-science collaboration and future improvements.

Abstract

We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.

Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation

TL;DR

Bielik 7B v0.1 is a Polish-focused 7B parameter language model developed to overcome data scarcity and linguistic adaptation challenges by leveraging curated Polish data, targeted post-training, and advanced fine-tuning strategies. The work introduces Weighted Instruction Cross-Entropy Loss and Adaptive Learning Rate to optimally learn from mixed-quality instruction data, and demonstrates strong Polish RAG capabilities and MT-Bench performance, including notable gains on RAG Reader tasks. Evaluations on the Open PL LLM Leaderboard and Polish MT-Bench show competitive results, with Bielik excelling in context-aware tasks and reasoning while maintaining solid overall performance. The paper also explores quantization, calibration, and efficient implementation to broaden accessibility, highlighting practical implications for deploying Polish NLP systems in resource-constrained environments and as core training innovations, and emphasizes open-science collaboration and future improvements.

Abstract

We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in language model development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.

Paper Structure

This paper contains 40 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Confusion matrix illustrating validation results for the XGBoost classifier model.
  • Figure 2: Training loss over the training tokens for the base model.
  • Figure 3: Training accuracy over the training tokens for the base model.
  • Figure 4: Training loss over the training iterations for the instruction model.
  • Figure 5: Training accuracy over the training iterations for the instruction model.
  • ...and 1 more figures