Table of Contents
Fetching ...

The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt

TL;DR

This work introduces AltPrag, a principled dataset that tests pragmatic competence in LLMs by contrasting two equally plausible but pragmatically distinct continuations. Using an LLM-as-a-judge framework and GPT-4o as the reference annotator, the study tracks how pragmatic understanding emerges across pretraining, supervised fine-tuning, and preference optimization, evaluating a wide range of open-source models. Key findings show that base models already possess non-trivial pragmatic sensitivity, and that both scale and alignment stages (SFT and DPO) yield measurable gains, with cognitive-pragmatic reasoning strengthening earlier and sociopragmatic sensitivity increasing with DPO. The results highlight pragmatics as an emergent, compositional property of LLM training and offer guidance for aligning models with human communicative norms through staged training and robust evaluation. The work also demonstrates the critical role of pretraining data scale and quality in shaping pragmatic abilities, suggesting that foundational data choices can substantially influence downstream communicative behavior.

Abstract

Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker's intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

TL;DR

This work introduces AltPrag, a principled dataset that tests pragmatic competence in LLMs by contrasting two equally plausible but pragmatically distinct continuations. Using an LLM-as-a-judge framework and GPT-4o as the reference annotator, the study tracks how pragmatic understanding emerges across pretraining, supervised fine-tuning, and preference optimization, evaluating a wide range of open-source models. Key findings show that base models already possess non-trivial pragmatic sensitivity, and that both scale and alignment stages (SFT and DPO) yield measurable gains, with cognitive-pragmatic reasoning strengthening earlier and sociopragmatic sensitivity increasing with DPO. The results highlight pragmatics as an emergent, compositional property of LLM training and offer guidance for aligning models with human communicative norms through staged training and robust evaluation. The work also demonstrates the critical role of pretraining data scale and quality in shaping pragmatic abilities, suggesting that foundational data choices can substantially influence downstream communicative behavior.

Abstract

Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker's intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

Paper Structure

This paper contains 33 sections, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Illustration of alternatives. Two appropriate replies to the same question convey different pragmatic forces, the upper direct and explanatory, the lower playful and implicitly affirmative. We prompt LLMs to interpret the speaker’s intent behind each reply and articulate situational motivations one would be preferred over the other, thereby isolating pragmatic reasoning by holding the context and literal content constant.
  • Figure 2: An illustration of the data generation process and evaluation workflow. After the majority voting phase, we construct a mirrored version by swapping the order of the two responses and their associated reference labels, resulting in a total of 1,300 data points.
  • Figure 3: Average 10-point quality scores across Base, SFT, and DPO stages for different model families. Significance codes are based on Wilcoxon signed-rank tests comparing each stage with the previous one (e.g., SFT vs. Base, DPO vs. SFT). Asterisks denote statistical significance: * $p < 0.05$, ** $p < 0.01$. Base-stage results are not assigned significance codes as they are used as reference baselines.
  • Figure 4: The Qwen-3 series achieves comparatively higher scores with fewer parameters, illustrating that scaling pretraining data size can enhance a model's capacity for pragmatic reasoning.
  • Figure 5: Distribution of winning explanation categories across selected model comparisons. While both SFT and DPO stages are dominated by cognitive-pragmatic explanations, the DPO stage shows a notable increase in sociopragmatic responses, indicating enhanced sensitivity to social context and appropriateness.
  • ...and 18 more figures