Table of Contents
Fetching ...

LLMic: Romanian Foundation Language Model

Vlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu, Alexandru Agache, Costin Raiciu

TL;DR

LLMic addresses the scarcity of high-quality Romanian NLP models by building a bilingual Romanian-English foundation model at 3B parameters. It delivers a full pipeline from data collection (Romanian-English corpora, curated sources, and parallel data) and a purpose-built GPT-NeoX–style tokenizer to a Llama2-inspired decoder architecture trained with a carefully staged language mix and a cosine LR schedule. The model demonstrates competitive English-to-Romanian translation after fine-tuning and remains viable for edge deployment through quantization, outperforming several open models and approaching closed-model performance. By releasing LLMic under Apache 2.0, the work aims to accelerate Romanian NLP tooling and community-driven model development.

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks. This opens the path for efficient large-scale processing for the Romanian language community, using the much smaller LLMic model

LLMic: Romanian Foundation Language Model

TL;DR

LLMic addresses the scarcity of high-quality Romanian NLP models by building a bilingual Romanian-English foundation model at 3B parameters. It delivers a full pipeline from data collection (Romanian-English corpora, curated sources, and parallel data) and a purpose-built GPT-NeoX–style tokenizer to a Llama2-inspired decoder architecture trained with a carefully staged language mix and a cosine LR schedule. The model demonstrates competitive English-to-Romanian translation after fine-tuning and remains viable for edge deployment through quantization, outperforming several open models and approaching closed-model performance. By releasing LLMic under Apache 2.0, the work aims to accelerate Romanian NLP tooling and community-driven model development.

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks. This opens the path for efficient large-scale processing for the Romanian language community, using the much smaller LLMic model
Paper Structure (9 sections, 1 figure, 3 tables)

This paper contains 9 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Tokenizer fertility analysis across different languages and text sources. The graph shows the relationship between input text and resulting tokens.