Table of Contents
Fetching ...

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Jannis Vamvas, Ignacio Pérez Prat, Angela Heldstab, Dominic P. Fischer, Sina Ahmadi, Rico Sennrich

Abstract

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Abstract

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Paper Structure

This paper contains 52 sections, 3 figures, 22 tables.

Figures (3)

  • Figure 1: LLMs have asymmetric translation capabilities regarding low-resource or multi-variety languages like Romansh. In the case of Romansh, they demonstrate a general understanding of all varieties when translating out of the language, but they fail to adhere to a specific target variety when translating into the language. This asymmetry is relevant for data augmentation.
  • Figure 2: (a): The varieties of Romansh are highly diverse, as shown in this example from the WMT24++ benchmark. (b): Translations out of German by Gemini 2.5 Flash (which we use for data augmentation) are often in the wrong target variety, according to a confusion matrix that evaluates the LLM's translations with references for all varieties vamvas-et-al-2025-expanding. (c): Our NMT system adheres to the target varieties and achieves higher BLEU.
  • Figure 3: Confusion matrices similar to Figure \ref{['fig:figure2']} illustrating the target variety adherence in German$\rightarrow$Romansh translation. Results are based on BLEU.