Table of Contents
Fetching ...

When Contextual Inference Fails: Cancelability in Interactive Instruction Following

Natalia Bila, Kata Naszádi, Alexandra Mayn, Christof Monz

Abstract

We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.

When Contextual Inference Fails: Cancelability in Interactive Instruction Following

Abstract

We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.
Paper Structure (22 sections, 6 figures, 6 tables)

This paper contains 22 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The builder receives feedback in the literal speaker condition about the fact that the pragmatic inference about continuing with the same color failed.
  • Figure 2: Example of contextual enrichment and cancellation in the Minecraft dialog corpus narayan-chen-etal-2019-collaborative
  • Figure 3: An example BWIM interaction. The session starts with a system prompt with a description of the grid and the required answer format, then in each round, the speaker sends building instructions. The model has an option either to respond directly or to ask a clarification question at a cost. After receiving the model's final answer, the speaker sends feedback and updates the score. Thus, the model must balance the risk of building incorrectly and the cost of asking questions in order to maximize the final score.
  • Figure 4: Mean certainty ratings by condition, specification, and source. Bars show mean certainty ratings for pragmatic and literal speaker across three models and human participants, distinguishing between fully-specified and underspecified trials. Error bars indicate standard error. All models and human participants gave higher ratings to fully-specified trials in both speaker conditions. Underspecified trials were rated higher in the pragmatic condition.
  • Figure 5: Confidence and response change throughout interactions with different speakers. Proportion of pragmatic and non-pragmatic responses are shown for each model and human participant and speaker order (Literal Lisa or Pragmatic Pia). Colors indicate response type and confidence rating (1-4 scale). Vertical dashed lines indicate the change of speaker. Each speaker block is divided into 4 time segments. GPT's adaptation effects mirror those of human participants, while Gemini shows a strong carry-over of mistrust that was atypical in human participants and Claude shows no sensitivity to partner type.
  • ...and 1 more figures