Table of Contents
Fetching ...

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

Settaluri Lakshmi Sravanthi, Meet Doshi, Tankala Pavan Kalyan, Rudra Murthy, Pushpak Bhattacharyya, Raj Dabre

TL;DR

This work introduces PUB, a unified Pragmatics Understanding Benchmark that evaluates LLM pragmatic reasoning across four phenomena (Implicature, Presupposition, Deixis, Reference) via 14 MCQA tasks totaling 28k items. It systematically compares six model families (including llama-2, t5, Flan-t5, and GPT-3.5) using two prompting approaches and a consistency metric (PPA), with extensive human evaluation. Findings show instruction-tuning benefits small models but has limited impact on large ones, while humans consistently outperform models and task sensitivity reveals variability in model behavior. PUB offers a path to diagnosing and advancing pragmatic capabilities in LLMs with realistic, linguistically grounded tasks and evaluation protocols.

Abstract

LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

TL;DR

This work introduces PUB, a unified Pragmatics Understanding Benchmark that evaluates LLM pragmatic reasoning across four phenomena (Implicature, Presupposition, Deixis, Reference) via 14 MCQA tasks totaling 28k items. It systematically compares six model families (including llama-2, t5, Flan-t5, and GPT-3.5) using two prompting approaches and a consistency metric (PPA), with extensive human evaluation. Findings show instruction-tuning benefits small models but has limited impact on large ones, while humans consistently outperform models and task sensitivity reveals variability in model behavior. PUB offers a path to diagnosing and advancing pragmatic capabilities in LLMs with realistic, linguistically grounded tasks and evaluation protocols.

Abstract

LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.
Paper Structure (15 sections, 6 figures, 1 table)

This paper contains 15 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Average performance of models on three different pragmatics phenomena. Average accuracy for reference and deixis are merged and plotted as Reference as they are closely related phenomena. Human - I, P, R represent the performance of human evaluators on Implicature, Presupposition, and Reference respectively
  • Figure 2: Examples of each task from PUB, The tasks are divided across four domains of pragmatics (Implicature, Presupposition, Reference, and Deixis). Our proposed benchmark builds upon existing pragmatic datasets and combines our newly annotated datasets comprising 6k annotations to complete the pragmatic evaluation test suite with 28k examples. We have reformatted the existing datasets into MCQA prompts that explicitly test these abilities.
  • Figure 3: Comparison of various models' multiple choice symbol binding using PPA. Results averaged across Task 4, 11, and 14, representing different pragmatic domains.
  • Figure 4: Results (accuracy) for tasks 2 & 3, tasks 5 & 6 and tasks 7, 8 & 9. The results presented in this table are the maximum across all types of evaluations (0-shot and 3-shot Cloze and MCQA) performed on the models.
  • Figure 5: Confusion matrix comparing ground truth with Language Models (LLMs) and ground truth with humans, revealing LLMs' tendency to misclassify positive labels as negatives. Here GT refers to ground truth.
  • ...and 1 more figures