Table of Contents
Fetching ...

RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware

Gonzalo Santamaría Gómez, Guillem García Subies, Pablo Gutiérrez Ruiz, Mario González Valero, Natàlia Fuertes, Helena Montoro Zamorano, Carmen Muñoz Sanz, Leire Rosado Plaza, Nuria Aldama García, David Betancur Sánchez, Kateryna Sushkova, Marta Guerrero Nieto, Álvaro Barbero Jiménez

TL;DR

The paper demonstrates that robust Spanish-language alignment of a 7B-scale open LLM is achievable with bounded hardware by emphasizing high-quality data collection, automated evaluation, and Direct Preference Optimization (DPO) fine-tuning. It introduces RigoChat 2, built on Qwen-2.5-7B-Instruct, and leverages LoRA-based PEFT, HQ+ data augmentation, and a large private Preference Dataset to improve Spanish performance while preserving general capabilities. Comprehensive evaluations across Spanish and multilingual benchmarks show competitive, and in some cases superior, performance relative to larger models, with quantized variants enabling efficient CPU inference. The work highlights data quality as a critical driver of performance, enabling accessible, privacy-preserving, and resource-efficient deployment, and outlines concrete future directions for evaluation, data curation, and transfer to broader multilingual NLU tasks.

Abstract

Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.

RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware

TL;DR

The paper demonstrates that robust Spanish-language alignment of a 7B-scale open LLM is achievable with bounded hardware by emphasizing high-quality data collection, automated evaluation, and Direct Preference Optimization (DPO) fine-tuning. It introduces RigoChat 2, built on Qwen-2.5-7B-Instruct, and leverages LoRA-based PEFT, HQ+ data augmentation, and a large private Preference Dataset to improve Spanish performance while preserving general capabilities. Comprehensive evaluations across Spanish and multilingual benchmarks show competitive, and in some cases superior, performance relative to larger models, with quantized variants enabling efficient CPU inference. The work highlights data quality as a critical driver of performance, enabling accessible, privacy-preserving, and resource-efficient deployment, and outlines concrete future directions for evaluation, data curation, and transfer to broader multilingual NLU tasks.

Abstract

Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.

Paper Structure

This paper contains 20 sections, 1 equation, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Task distribution estimation of the compilation of data resources.
  • Figure 2: System prompt used for the LLM evaluation in Abstractive Question-Answering.
  • Figure 3: Density comparisons across different metrics. Legend and color map are sorted based on distance to human evaluations.
  • Figure 4: DPO Loss rafailov2024directpreferenceoptimizationlanguage of the training process.
  • Figure 5: Bar Plot of the evaluation results. The figure is scaled from 50 to 90 in order to better appreciate the differences between all tested models.
  • ...and 2 more figures