Table of Contents
Fetching ...

Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

Benjamin Icard, Evangelia Zve, Lila Sainero, Alice Breton, Jean-Gabriel Ganascia

TL;DR

The paper investigates how writing style, beyond topic, shapes embedding dispersion across multilingual language models using a bilingual Queneau-Fénéon corpus augmented with GPT-4o-generated variants to decouple topic and style. It combines clustering, PCA, and UMAP-based dispersion analyses with an interpretability framework to quantify and explain stylistic influences on embeddings. Findings show that topic has a stronger impact on dispersion than style, but style variations significantly modulate dispersion in many models and languages, with French showing stronger stylistic sensitivity than English. These results advance understanding of how stylistic cues are encoded in embeddings and offer guidance for improving model interpretability across languages and genres.

Abstract

This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.

Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models

TL;DR

The paper investigates how writing style, beyond topic, shapes embedding dispersion across multilingual language models using a bilingual Queneau-Fénéon corpus augmented with GPT-4o-generated variants to decouple topic and style. It combines clustering, PCA, and UMAP-based dispersion analyses with an interpretability framework to quantify and explain stylistic influences on embeddings. Findings show that topic has a stronger impact on dispersion than style, but style variations significantly modulate dispersion in many models and languages, with French showing stronger stylistic sensitivity than English. These results advance understanding of how stylistic cues are encoded in embeddings and offer guidance for improving model interpretability across languages and genres.

Abstract

This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.
Paper Structure (10 sections, 6 equations, 4 figures, 4 tables)

This paper contains 10 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: French and English $\texttt{GPT-4o}$ prompts used for generating $\textsc{Queneau\_gen}$ and $\textsc{Feneon\_gen}$, based on $\textsc{Queneau\_ref}$ and $\textsc{Feneon\_ref}$.
  • Figure 2: 2D PCA projection of the 4 clusters obtained with $\texttt{mistral-embed}$ on the Queneau-Feneon corpus and distribution of texts per cluster, for French (a) and for English (b).
  • Figure 3: 2D UMAP contour plots of the embedding dispersion obtained on the $\textsc{Queneau-Feneon}$ corpus with model $\texttt{all-MiniLM-L12-v2}$, for French (left) and for English (right). In each subplot, the overall spread of the embeddings around centroid (for the last seed) is represented by the external contour line, the isolines represent differences in densities of embedding vectors, the centroid is indicated by a dot, and $\bar{d}_X$ corresponds to the mean centroid distance for the targeted corpus $X$.
  • Figure 4: Correlation matrices between differences in dispersion ($\Delta d$) and differences in frequencies of the eight stylistic features ($\Delta f^{s}$) for the two comparisons of interest, for French (top) and English (bottom). Here "-" corresponds to correlations that were intentionally omitted, as they correspond to differences in features previously observed as non significant (see Table \ref{['tab:groundavg']}).