Table of Contents
Fetching ...

Tradutor: Building a Variety Specific Translation Model

Hugo Sousa, Satya Almasian, Ricardo Campos, Alípio Jorge

TL;DR

This paper addresses the underrepresentation of European Portuguese in translation systems by proposing a dedicated EP-to-English translation pipeline and releasing the largest EP–English parallel corpus to date, PTradutor. The authors leverage a retro-translation approach, translating EP monolingual data into English to create parallel data for fine-tuning small, instruction-tuned LMs, and compare full fine-tuning with LoRA-based methods across multiple benchmarks. Their results show that full fine-tuning on LLaMA-3 yields the strongest open-source performance, closely approaching industry-level systems, while LoRA offers better linguistic alignment to European Portuguese but at the cost of translation quality, highlighting the trade-offs between resource efficiency and accuracy. By releasing dataset, models, and code, the work enables broader research into language-variety adaptation and practical deployment for underrepresented languages.

Abstract

Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

Tradutor: Building a Variety Specific Translation Model

TL;DR

This paper addresses the underrepresentation of European Portuguese in translation systems by proposing a dedicated EP-to-English translation pipeline and releasing the largest EP–English parallel corpus to date, PTradutor. The authors leverage a retro-translation approach, translating EP monolingual data into English to create parallel data for fine-tuning small, instruction-tuned LMs, and compare full fine-tuning with LoRA-based methods across multiple benchmarks. Their results show that full fine-tuning on LLaMA-3 yields the strongest open-source performance, closely approaching industry-level systems, while LoRA offers better linguistic alignment to European Portuguese but at the cost of translation quality, highlighting the trade-offs between resource efficiency and accuracy. By releasing dataset, models, and code, the work enables broader research into language-variety adaptation and practical deployment for underrepresented languages.

Abstract

Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

Paper Structure

This paper contains 24 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Number of documents (in millions) remaining after each step of our filtering pipeline.