Table of Contents
Fetching ...

mStyleDistance: Multilingual Style Embeddings and their Evaluation

Justin Qiu, Jiacheng Zhu, Ajay Patel, Marianna Apidianaki, Chris Callison-Burch

TL;DR

The paper tackles the lack of multilingual style representations by proposing mStyleDistance, a multilingual style embedding model trained with contrastive learning on synthetic, cross-lingual data across nine languages. It introduces mSynthStel, a multilingual synthetic dataset of paraphrase pairs covering 40 style features, and two evaluation benchmarks (multilingual STEL-or-Content SoC and cross-lingual SoC) to quantify style preservation across languages, plus an authorship verification task to probe downstream utility. Empirical results show that mStyleDistance outperforms baselines on multilingual and cross-lingual style benchmarks and generalizes to unseen features and languages, with strong performance in downstream AV tasks. The work provides public release of the model, data, and benchmarks, highlighting the potential and limitations of synthetic multilingual style representations for broad linguistic coverage and practical NLP tasks.

Abstract

Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings' quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at https://huggingface.co/StyleDistance/mstyledistance .

mStyleDistance: Multilingual Style Embeddings and their Evaluation

TL;DR

The paper tackles the lack of multilingual style representations by proposing mStyleDistance, a multilingual style embedding model trained with contrastive learning on synthetic, cross-lingual data across nine languages. It introduces mSynthStel, a multilingual synthetic dataset of paraphrase pairs covering 40 style features, and two evaluation benchmarks (multilingual STEL-or-Content SoC and cross-lingual SoC) to quantify style preservation across languages, plus an authorship verification task to probe downstream utility. Empirical results show that mStyleDistance outperforms baselines on multilingual and cross-lingual style benchmarks and generalizes to unseen features and languages, with strong performance in downstream AV tasks. The work provides public release of the model, data, and benchmarks, highlighting the potential and limitations of synthetic multilingual style representations for broad linguistic coverage and practical NLP tasks.

Abstract

Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings' quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at https://huggingface.co/StyleDistance/mstyledistance .

Paper Structure

This paper contains 24 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: mStyleDistance is trained using contrastive learning from synthetic positive and negative examples of 40 style features in 9 languages to form both multilingual and cross-lingual training triplets.
  • Figure 2: Example prompt for generating a pair of sentences in Russian.
  • Figure 3: Instances from the annotation interface.
  • Figure 4: Instances from our multilingual and cross-lingual SoC benchmarks. For multilingual SoC, the anchor is in the same language as the pos and neg sentences. For cross-lingual SoC, the anchor is in a different language from the pos and neg sentences.