Table of Contents
Fetching ...

Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier García Gilabert, Carlos Escolano, Aleix Sant Savall, Francesca De Luca Fornaciari, Audrey Mash, Xixian Liao, Maite Melero

TL;DR

This work investigates translation capabilities of decoder-only LLMs trained exclusively on parallel data by introducing PLUME, a trio of 2B-parameter models with vocabularies of 32k, 128k, and 256k trained on Catalan-centric parallel data. PLUME achieves competitive performance with encoder-decoder MT systems across 16 supervised and 56 zero-shot directions, enabling a focused study of how prompt design and cross-lingual representations influence translation. The authors analyze attention patterns, source-tag usage, and cross-lingual subspaces, finding that larger vocabularies improve zero-shot translation and that certain attention heads can be pruned with minimal impact. The study provides actionable insights into vocabulary design, interpretability, and potential pruning strategies, while outlining limitations related to data scope and scalability and suggesting directions for future research on sink-head dynamics and broader language coverage.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

Investigating the translation capabilities of Large Language Models trained on parallel data only

TL;DR

This work investigates translation capabilities of decoder-only LLMs trained exclusively on parallel data by introducing PLUME, a trio of 2B-parameter models with vocabularies of 32k, 128k, and 256k trained on Catalan-centric parallel data. PLUME achieves competitive performance with encoder-decoder MT systems across 16 supervised and 56 zero-shot directions, enabling a focused study of how prompt design and cross-lingual representations influence translation. The authors analyze attention patterns, source-tag usage, and cross-lingual subspaces, finding that larger vocabularies improve zero-shot translation and that certain attention heads can be pruned with minimal impact. The study provides actionable insights into vocabulary design, interpretability, and potential pruning strategies, while outlining limitations related to data scope and scalability and suggesting directions for future research on sink-head dynamics and broader language coverage.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.
Paper Structure (35 sections, 5 equations, 18 figures, 24 tables)

This paper contains 35 sections, 5 equations, 18 figures, 24 tables.

Figures (18)

  • Figure 1: Prompt strategy used to train Plume.
  • Figure 2: Coverage evaluating on Flores-200 devtest using Plume 32k. Each heatmap for each part of the prompt shows the coverage scores for each layer (vertical axis) and for each head (horizontal axis) in the model.
  • Figure 3: Illustration of the regions in the attention matrix used to compute coverage for each part of the prompt. We show the cross-attention regions between decoded tokens and the BOS, source tag, source sentence and target tag tokens in green, yellow, blue, and red, respectively.
  • Figure 4: Impact of masking on BLEU score and number of masked heads across different coverage thresholds (left). Accumulated coverage of masked heads for source tag, target tag, source sentence, and BOS (right). Experiments are evaluated on the Spanish to Catalan direction.
  • Figure 5: Mean distance between language subspaces grouped by vocabulary size. Additional plots grouped by languages and vocabulary sizes are included in Appendix \ref{['sec:subspace_distances']}.
  • ...and 13 more figures