Table of Contents
Fetching ...

Do language models accommodate their users? A study of linguistic convergence

Terra Blevins, Susanne Schmalwieser, Benjamin Roth

TL;DR

The paper investigates whether language models linguistically converge to their users by grounding model completions in existing human dialogues. Using a synthetic paradigm across 16 models, three dialogue corpora, and multiple stylometric features, it shows that LLMs exhibit strong convergence to context, frequently surpassing random baselines and, in many cases, overconverging relative to human utterances. Convergence patterns vary by model family, training regime, and dataset, with pretrained models generally showing greater adaptation than instruction-tuned ones. The findings suggest that model convergence arises from pretraining dynamics rather than social goals, carrying implications for how we evaluate and deploy conversational AI and highlighting the need for user studies to understand perceptual effects on trust and interaction quality.

Abstract

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

Do language models accommodate their users? A study of linguistic convergence

TL;DR

The paper investigates whether language models linguistically converge to their users by grounding model completions in existing human dialogues. Using a synthetic paradigm across 16 models, three dialogue corpora, and multiple stylometric features, it shows that LLMs exhibit strong convergence to context, frequently surpassing random baselines and, in many cases, overconverging relative to human utterances. Convergence patterns vary by model family, training regime, and dataset, with pretrained models generally showing greater adaptation than instruction-tuned ones. The findings suggest that model convergence arises from pretraining dynamics rather than social goals, carrying implications for how we evaluate and deploy conversational AI and highlighting the need for user studies to understand perceptual effects on trust and interaction quality.

Abstract

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

Paper Structure

This paper contains 19 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of the human and random baselines on each metric across datasets. Metrics marked with $\uparrow$ indicate more agreement with higher values; and $\downarrow$, vice-versa.
  • Figure 2: Scatter plot of Gemma and Llama Model scores on various convergence metrics relative to human and random baselines on DailyDialog (top row), Movie corpus (middle), and NPR (bottom), across model sizes (Billion parameters). PT indicates pretrained checkpoints while IT are instruction-tuned. Metrics marked with $\uparrow$ indicate more agreement with higher values; and $\downarrow$, vice-versa.
  • Figure 3: Summary of model convergence relative to the human and random baselines for individual LIWC word classes on the DailyDialog dataset. Pink cells indicate classes where the model significantly ($p<0.05$) overconverges relative to the baseline, while green cells indicate significant undercongergence. Gray cells are not significantly different from the baseline.
  • Figure 4: Stepwise analysis of convergence in LM generations (and human ground truth utterances) for DailyDialog, measuring the agreement between each utterance $r_{t=n}$ and the preceding utterances $r_{t=1,...,n-1}$ on our four metrics. Timesteps in gray ($t=2,4$) indicate the prior turns in the role the model adopts, $S_y$, while white timesteps are utterances from the other speaker, $S_x$. Each line reports the averaged score across all model sizes in a given family.
  • Figure 5: Summary of model convergence relative to human and random baselines on LIWC word classes for DailyDialog.
  • ...and 3 more figures