Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero
TL;DR
This work investigates translation capabilities of decoder-only LLMs trained exclusively on parallel data by introducing PLUME, a trio of 2B-parameter models with vocabularies of 32k, 128k, and 256k trained on Catalan-centric parallel data. PLUME achieves competitive performance with encoder-decoder MT systems across 16 supervised and 56 zero-shot directions, enabling a focused study of how prompt design and cross-lingual representations influence translation. The authors analyze attention patterns, source-tag usage, and cross-lingual subspaces, finding that larger vocabularies improve zero-shot translation and that certain attention heads can be pruned with minimal impact. The study provides actionable insights into vocabulary design, interpretability, and potential pruning strategies, while outlining limitations related to data scope and scalability and suggesting directions for future research on sink-head dynamics and broader language coverage.
Abstract
In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.
