Table of Contents
Fetching ...

Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Non-Literal Intent Resolution in LLMs

Akhila Yerukola, Saujas Vaduguru, Daniel Fried, Maarten Sap

TL;DR

The paper addresses how LLMs understand and respond to non-literal language by shifting from discriminative intent detection to generative pragmatic evaluation. It introduces a formal framework that contexts $C$, non-literal utterances $U_1^N$, and two reference paths based on true $I_T$ vs literal $I_L$ intents, and compares model outputs to reference responses using a similarity metric, $sim(U_2^N, U_2^T) > sim(U_2^N, U_2^L)$. Across five open-source LLMs, the study reports average pragmatic-generation accuracy of roughly 50–55%, with chain-of-thought prompting giving modest gains and oracle-intention cues achieving up to about 75% accuracy for some models (e.g., Mistral-Instruct); these results reveal a substantial gap in pragmatic understanding. Importantly, discriminative intention detection proves easier than generating pragmatically appropriate responses, signaling that detection and generation rely on different capabilities and should be evaluated separately. The work also demonstrates that providing explicit intention cues, including phenomenon type, substantially improves performance, while highlighting limitations like restricted context and phenomena scope, which motivate future research in explicit intention modeling and robust pragmatic generation. The framework has practical implications for building more capable conversational agents that can understand and act on non-literal communication in naturalistic interactions.

Abstract

Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.

Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Non-Literal Intent Resolution in LLMs

TL;DR

The paper addresses how LLMs understand and respond to non-literal language by shifting from discriminative intent detection to generative pragmatic evaluation. It introduces a formal framework that contexts , non-literal utterances , and two reference paths based on true vs literal intents, and compares model outputs to reference responses using a similarity metric, . Across five open-source LLMs, the study reports average pragmatic-generation accuracy of roughly 50–55%, with chain-of-thought prompting giving modest gains and oracle-intention cues achieving up to about 75% accuracy for some models (e.g., Mistral-Instruct); these results reveal a substantial gap in pragmatic understanding. Importantly, discriminative intention detection proves easier than generating pragmatically appropriate responses, signaling that detection and generation rely on different capabilities and should be evaluated separately. The work also demonstrates that providing explicit intention cues, including phenomenon type, substantially improves performance, while highlighting limitations like restricted context and phenomena scope, which motivate future research in explicit intention modeling and robust pragmatic generation. The framework has practical implications for building more capable conversational agents that can understand and act on non-literal communication in naturalistic interactions.

Abstract

Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.
Paper Structure (35 sections, 5 figures)

This paper contains 35 sections, 5 figures.

Figures (5)

  • Figure 1: Framework to evaluate whether an LLM can generate an appropriate response to non-literal language use. Given a context $C$ and a non-literal utterance $U_1^N$, the model responds with $U_2^N$. Our proposed framework compares $U_2^N$ against responses ($U_2^L$ and $U_2^T$) from two counterfactual dialog chains based on conveying incorrect literal meaning $I_L$ and direct true intent $I_T$. We then compare the similarity of the model generated response $U_2^N$ to these reference responses, under the context $C$, to determine whether it is appropriate.
  • Figure 2: Comparison between intention resolution in response generation vs intention detection by LLMs. On average, LLMs fine the generative setting harder than the discriminative setting for non-literal language use.
  • Figure 3: Results from experiments with CoT prompting show that performance is highest when providing oracle true intention, and lowest with no oracle information.
  • Figure 4: Positive correlation between inferred intention accuracy and pragmatic response accuracy.
  • Figure 5: Chain-of-thought Prompting templates used in Section \ref{['ssec:cot']}. Orange highlighted text is the explicitly provided oracle information.