A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Jenny Kunz; Anja Jarochenko; Marcel Bollmann

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Jenny Kunz, Anja Jarochenko, Marcel Bollmann

TL;DR

This work introduces the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models.

Abstract

Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 3 figures, 3 tables)

This paper contains 35 sections, 1 equation, 3 figures, 3 tables.

Introduction
Release
Background
Translationese
English-to-Swedish translationese
Error tags for translations
Dataset Construction
Annotation Process
Error Tags
Dataset Analysis
Causes of Minor Errors
Semantic shift (SEM)
Lexical preference (PR)
Causes of Major Errors
Loss of meaning (BET)
...and 20 more sections

Figures (3)

Figure 1: Error tag distribution for the Swedish translations; cf. Sec. \ref{['sec:error_tags']} for an explanation of tags.
Figure 2: Prompting setups. Each box shows a minimal pair: Sentence 1 is a translationese variant, sentence 2 is an idiomatic variant of a translation of the same English sentence. We compute the perplexity of each variant to determine which one the model prefers. Translations of text in the figure: *Translate the following sentence to Swedish.$\dagger$Translate the following sentence to Swedish, considering the context.
Figure 3: Percentage of samples where models prefer the OPUS-translated sentence over the human alternative, by error tag. See Section \ref{['sec:error_tags']} for explanations of error tags.

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

TL;DR

Abstract

A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)