Table of Contents
Fetching ...

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, Graham Neubig

Abstract

Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

Abstract

Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

Paper Structure

This paper contains 67 sections, 11 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: IdioleX used to compare the idiolectal alignment between user input and GPT 5.1 generations in casual Argentinian Spanish.
  • Figure 1: DID F1-score and exact match accuracy (EM) results on Spanish (DSL-ML) and Arabic (MADAR 26). External baselines are: the top scoring submission to DSL-ML, the top scoring submission to MADAR 26, and the top scoring neural submission to MADAR 26.
  • Figure 2: IdioleX training framework. During training, all batches are sampled such that every individual item can act as an anchor for contrastive learning, necessitating that there are $2^{3-n}$ samples for each proximity score $n \in [0,3]$.
  • Figure 3: Non-fine-tuned IdioleX classification model's likelihood distribution over samples on multi-label Spanish DID.
  • Figure 4: IdioleX-based stylistic similarity scores of a random sample of the withheld test set of the Reddit corpus plotted against Multilingual-E5 semantic similarity scores wang2024multilingual. We see a low, but positive correlation.
  • ...and 1 more figures