Table of Contents
Fetching ...

FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes

Dawid Wiśniewski, Zofia Rostek, Artur Nowakowski

TL;DR

FAME-MT tackles the challenge of enforcing target-language formality in machine translation by introducing a large-scale, multilingual dataset with formal and informal annotations across 112 European language pairs. The authors present a three-step pipeline for data collection, labeling, and compilation, leveraging classifiers trained on English and additional languages to produce 100,000 exemplars per language pair, then demonstrate formality-controlled MT via fine-tuning with specialized tokens. They validate dataset quality through exploratory analyses on length, tokens, and readability, and show practical MT gains or stability in targeted directions, releasing both data and tooling openly. The work offers a scalable path to formality-aware MT for underrepresented languages, with potential impact on user experience and translation adequacy in formal contexts.

Abstract

People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT -- a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for each European target language considered. We describe the dataset creation procedure, the analysis of the dataset's quality showing that FAME-MT is a reliable source of language register information, and we present a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is published online: https://github.com/laniqo-public/fame-mt/ .

FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes

TL;DR

FAME-MT tackles the challenge of enforcing target-language formality in machine translation by introducing a large-scale, multilingual dataset with formal and informal annotations across 112 European language pairs. The authors present a three-step pipeline for data collection, labeling, and compilation, leveraging classifiers trained on English and additional languages to produce 100,000 exemplars per language pair, then demonstrate formality-controlled MT via fine-tuning with specialized tokens. They validate dataset quality through exploratory analyses on length, tokens, and readability, and show practical MT gains or stability in targeted directions, releasing both data and tooling openly. The work offers a scalable path to formality-aware MT for underrepresented languages, with potential impact on user experience and translation adequacy in formal contexts.

Abstract

People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT -- a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for each European target language considered. We describe the dataset creation procedure, the analysis of the dataset's quality showing that FAME-MT is a reliable source of language register information, and we present a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is published online: https://github.com/laniqo-public/fame-mt/ .
Paper Structure (24 sections, 3 figures, 11 tables)

This paper contains 24 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Violin plots representing the distributions of sentence lengths interpreted as the number of characters. The upper figure represents the distributions calculated over the original dataset. As it shows that there are some outliers with big values, we provide the lower figure generated over a subset of texts whose lengths are between Q1 - 1.5 IQR and Q3 + 1.5 IQR (Q1=first quartile, Q3=third quartile, IQR=inter-quartile range) to focus more on the most common scenarios.
  • Figure 2: Plots representing the distributions of the number of punctuation signs in a sentence for a given language. The upper figure represents the distributions calculated over the original dataset. As it shows that there are some outliers, we provide the lower figure generated over a subset of texts whose lengths are between Q1 - 1.5 IQR and Q3 + 1.5 IQR (Q1=first quartile, Q3=third quartile, IQR=inter-quartile range) to focus more on the most common scenarios.
  • Figure 3: Plots representing the distributions of the mean word length in a given sentence per given language. The upper figure represents the distributions calculated over the original dataset. As it shows that there are some outliers, we provide the lower figure generated over a subset of texts whose lengths are between Q1 - 1.5 IQR and Q3 + 1.5 IQR (Q1=first quartile, Q3=third quartile, IQR=inter-quartile range) to focus more on the most common scenarios.