Table of Contents
Fetching ...

Measuring Intent Comprehension in LLMs

Nadav Kunievsky, James A. Evans

Abstract

People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type. Yet LLMs are trained to predict the next token solely from text input, not underlying intent. Because written language is an imperfect proxy for intent, and correlations between phrasing and desired outcomes can break down in training data, models that rely too heavily on surface cues may respond inconsistently to semantically equivalent prompts. This makes it essential to evaluate whether LLMs can reliably infer user intent-especially in high-stakes settings where robustness and generalization are critical. We introduce a formal framework for assessing intent comprehension in LLMs: whether a model demonstrates robust understanding of user intent by producing consistent outputs across semantically equivalent prompts while differentiating between prompts with distinct intents. Our evaluation approach is based on a variance decomposition of model responses into three components: variability due to user intent, user articulation, and model uncertainty. Models that understand what users want, and are not overly sensitive to textual cues, should attribute most output variance to intent differences, rather than articulation style. Applying this framework across diverse domains, we find that, within the five LLaMA and Gemma models we evaluate, larger models typically assign a greater share of variance to intent, indicating stronger comprehension of intent, although gains are uneven and often modest with increasing model size. These results motivate moving beyond accuracy-only benchmarks toward semantic diagnostics that directly assess whether models understand what users intend.

Measuring Intent Comprehension in LLMs

Abstract

People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type. Yet LLMs are trained to predict the next token solely from text input, not underlying intent. Because written language is an imperfect proxy for intent, and correlations between phrasing and desired outcomes can break down in training data, models that rely too heavily on surface cues may respond inconsistently to semantically equivalent prompts. This makes it essential to evaluate whether LLMs can reliably infer user intent-especially in high-stakes settings where robustness and generalization are critical. We introduce a formal framework for assessing intent comprehension in LLMs: whether a model demonstrates robust understanding of user intent by producing consistent outputs across semantically equivalent prompts while differentiating between prompts with distinct intents. Our evaluation approach is based on a variance decomposition of model responses into three components: variability due to user intent, user articulation, and model uncertainty. Models that understand what users want, and are not overly sensitive to textual cues, should attribute most output variance to intent differences, rather than articulation style. Applying this framework across diverse domains, we find that, within the five LLaMA and Gemma models we evaluate, larger models typically assign a greater share of variance to intent, indicating stronger comprehension of intent, although gains are uneven and often modest with increasing model size. These results motivate moving beyond accuracy-only benchmarks toward semantic diagnostics that directly assess whether models understand what users intend.

Paper Structure

This paper contains 42 sections, 21 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Conceptual Illustration of our decomposition Measures
  • Figure 2: The Average Variance Decomposition Across Models, shaded area is 95% CI
  • Figure 3: ICI, MVS and IS across models, shaded area is 95% CI
  • Figure 4: Variance Decomposition across the different topics, with black bars indicating the 95% confidence intervals from using bootstrap with 100 repetitions.
  • Figure 5: Loan Approval - Information Theoretic Decomposition of Intent Comprehension
  • ...and 6 more figures

Theorems & Definitions (6)

  • Definition 2.1: Intent Comprehension
  • Remark 2.2
  • Remark 2.3
  • Definition 2.4: Sufficient Intent Comprehension with respect to an evaluator $\tilde{\tau}$ and $V$
  • Remark 2.5
  • Remark 3.1