Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models
Benjamin Icard, Evangelia Zve, Lila Sainero, Alice Breton, Jean-Gabriel Ganascia
TL;DR
The paper investigates how writing style, beyond topic, shapes embedding dispersion across multilingual language models using a bilingual Queneau-Fénéon corpus augmented with GPT-4o-generated variants to decouple topic and style. It combines clustering, PCA, and UMAP-based dispersion analyses with an interpretability framework to quantify and explain stylistic influences on embeddings. Findings show that topic has a stronger impact on dispersion than style, but style variations significantly modulate dispersion in many models and languages, with French showing stronger stylistic sensitivity than English. These results advance understanding of how stylistic cues are encoded in embeddings and offer guidance for improving model interpretability across languages and genres.
Abstract
This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.
