mStyleDistance: Multilingual Style Embeddings and their Evaluation
Justin Qiu, Jiacheng Zhu, Ajay Patel, Marianna Apidianaki, Chris Callison-Burch
TL;DR
The paper tackles the lack of multilingual style representations by proposing mStyleDistance, a multilingual style embedding model trained with contrastive learning on synthetic, cross-lingual data across nine languages. It introduces mSynthStel, a multilingual synthetic dataset of paraphrase pairs covering 40 style features, and two evaluation benchmarks (multilingual STEL-or-Content SoC and cross-lingual SoC) to quantify style preservation across languages, plus an authorship verification task to probe downstream utility. Empirical results show that mStyleDistance outperforms baselines on multilingual and cross-lingual style benchmarks and generalizes to unseen features and languages, with strong performance in downstream AV tasks. The work provides public release of the model, data, and benchmarks, highlighting the potential and limitations of synthetic multilingual style representations for broad linguistic coverage and practical NLP tasks.
Abstract
Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings' quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at https://huggingface.co/StyleDistance/mstyledistance .
