OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
Mihai Masala, Denis C. Ilie-Ablachim, Dragos Corlatescu, Miruna Zavelca, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
TL;DR
This work presents OpenLLM-Ro, the first open-source Romanian LLM effort built on Llama 2, introducing RoLlama foundational and chat variants trained with CulturaX-derived data and a translated instruction/conversation corpus. By leveraging continual pretraining and supervised finetuning on Romanian tasks, the authors demonstrate improvements over existing Romanian LLMs and provide a replicable recipe for low-resource languages. They evaluate across multiple benchmarks, address language-generation challenges with Romanian prompts, and highlight the importance of conversation data in finetuning. The work lays a foundation for Romanian NLP research and industry applications, and outlines clear paths for improving data quality, alignment, and scaling to larger models.
Abstract
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.
