Table of Contents
Fetching ...

Are LLMs good pragmatic speakers?

Mingyue Jian, N. Siddharth

TL;DR

This work investigates whether vanilla LLMs exhibit pragmatic speaker behavior within the Rational Speech Act framework by evaluating a TUNA-based reference-game task. It contrasts two meaning-function variants for RSA scoring—prompt-based and rule-based—against a vanilla LLM (Llama3-8B-Instruct) across top-k and logic-derived utterances, using Pearson and Spearman correlations to quantify alignment. The findings show positive but inconclusive correlations between LLM scores and RSA predictions, with stronger alignment when the RSA uses a rule-based MF for logic constructs and weaker alignment for top-k pragmatic sequences, suggesting that current LLMs do not robustly behave as pragmatic speakers in this setting. The study highlights the need for further work, including human-subject experiments, evaluation of additional models, iterated RSA analyses, and broader domains, to clarify under what conditions LLMs can approximate pragmatic speaker behavior.

Abstract

Large language models (LLMs) are trained on data assumed to include natural language pragmatics, but do they actually behave like pragmatic speakers? We attempt to answer this question using the Rational Speech Act (RSA) framework, which models pragmatic reasoning in human communication. Using the paradigm of a reference game constructed from the TUNA corpus, we score candidate referential utterances in both a state-of-the-art LLM (Llama3-8B-Instruct) and in the RSA model, comparing and contrasting these scores. Given that RSA requires defining alternative utterances and a truth-conditional meaning function, we explore such comparison for different choices of each of these requirements. We find that while scores from the LLM have some positive correlation with those from RSA, there isn't sufficient evidence to claim that it behaves like a pragmatic speaker. This initial study paves way for further targeted efforts exploring different models and settings, including human-subject evaluation, to see if LLMs truly can, or be made to, behave like pragmatic speakers.

Are LLMs good pragmatic speakers?

TL;DR

This work investigates whether vanilla LLMs exhibit pragmatic speaker behavior within the Rational Speech Act framework by evaluating a TUNA-based reference-game task. It contrasts two meaning-function variants for RSA scoring—prompt-based and rule-based—against a vanilla LLM (Llama3-8B-Instruct) across top-k and logic-derived utterances, using Pearson and Spearman correlations to quantify alignment. The findings show positive but inconclusive correlations between LLM scores and RSA predictions, with stronger alignment when the RSA uses a rule-based MF for logic constructs and weaker alignment for top-k pragmatic sequences, suggesting that current LLMs do not robustly behave as pragmatic speakers in this setting. The study highlights the need for further work, including human-subject experiments, evaluation of additional models, iterated RSA analyses, and broader domains, to clarify under what conditions LLMs can approximate pragmatic speaker behavior.

Abstract

Large language models (LLMs) are trained on data assumed to include natural language pragmatics, but do they actually behave like pragmatic speakers? We attempt to answer this question using the Rational Speech Act (RSA) framework, which models pragmatic reasoning in human communication. Using the paradigm of a reference game constructed from the TUNA corpus, we score candidate referential utterances in both a state-of-the-art LLM (Llama3-8B-Instruct) and in the RSA model, comparing and contrasting these scores. Given that RSA requires defining alternative utterances and a truth-conditional meaning function, we explore such comparison for different choices of each of these requirements. We find that while scores from the LLM have some positive correlation with those from RSA, there isn't sufficient evidence to claim that it behaves like a pragmatic speaker. This initial study paves way for further targeted efforts exploring different models and settings, including human-subject evaluation, to see if LLMs truly can, or be made to, behave like pragmatic speakers.

Paper Structure

This paper contains 15 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Correlation analysis for scores in the LLM and RSA models using (top) prompt-based MF and (bottom) rule-based MF. Red indicates logic-based alternatives, blue indicates the top-$k$ alternatives and grey indicates all alternatives regardless of construction method.
  • Figure 2: Example of the prompt used for generating top-k sequences with the LLM. The blue text indicates variable elements specific to each reference game instance.
  • Figure 3: Project methodology pipeline
  • Figure 4: Example of logical construction process, given the attribute sets in the world.
  • Figure 5: Results of the correlation scores of LLM and RSA model using different $\alpha$ values. Each subplot group shows the overall correlation for scores in the LLM and RSA models using (left) prompt-based MF and (right) rule-based MF. We report the PCC scores for each utterances type.
  • ...and 3 more figures