Table of Contents
Fetching ...

Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning

Tan Zhi-Xuan, Lance Ying, Vikash Mansinghka, Joshua B. Tenenbaum

TL;DR

CLIPS presents a Bayesian, language-grounded approach to pragmatic instruction following and goal assistance by treating humans as cooperative planners who communicate joint policies. It combines real-time planning with an LLM-based utterance model to infer goals from actions and language, then selects assistive actions by minimizing expected cost under the inferred posterior. In two domains, CLIPS outperforms baselines (including GPT-4V) in goal accuracy, assistive quality, and alignment with human judgments, demonstrating the value of grounding language in a principled theory of mind and cooperative planning. This work advances robust, uncertainty-aware human-AI collaboration and points to scalable extensions using probabilistic programming and information-gathering strategies for more trustworthy assistive systems.

Abstract

People often give instructions whose meaning is ambiguous without further context, expecting that their actions or goals will disambiguate their intentions. How can we build assistive agents that follow such instructions in a flexible, context-sensitive manner? This paper introduces cooperative language-guided inverse plan search (CLIPS), a Bayesian agent architecture for pragmatic instruction following and goal assistance. Our agent assists a human by modeling them as a cooperative planner who communicates joint plans to the assistant, then performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, our assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. We evaluate these capabilities in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters.

Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning

TL;DR

CLIPS presents a Bayesian, language-grounded approach to pragmatic instruction following and goal assistance by treating humans as cooperative planners who communicate joint policies. It combines real-time planning with an LLM-based utterance model to infer goals from actions and language, then selects assistive actions by minimizing expected cost under the inferred posterior. In two domains, CLIPS outperforms baselines (including GPT-4V) in goal accuracy, assistive quality, and alignment with human judgments, demonstrating the value of grounding language in a principled theory of mind and cooperative planning. This work advances robust, uncertainty-aware human-AI collaboration and points to scalable extensions using probabilistic programming and information-gathering strategies for more trustworthy assistive systems.

Abstract

People often give instructions whose meaning is ambiguous without further context, expecting that their actions or goals will disambiguate their intentions. How can we build assistive agents that follow such instructions in a flexible, context-sensitive manner? This paper introduces cooperative language-guided inverse plan search (CLIPS), a Bayesian agent architecture for pragmatic instruction following and goal assistance. Our agent assists a human by modeling them as a cooperative planner who communicates joint plans to the assistant, then performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, our assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. We evaluate these capabilities in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters.
Paper Structure (44 sections, 6 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 44 sections, 6 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of cooperative language-guided inverse plan search (CLIPS). We model a human principal as (a) cooperatively planning a joint policy for the human and the (robot) assistant. The human is (b) assumed to take actions from this joint policy while communicating planned actions as an instruction ("Can you pass me the red key?"). Observing this, (c) CLIPS infers the human's goal and policy via Bayesian inverse planning. CLIPS then (d) acts by minimizing expected goal achievement cost, pragmatically interpreting the ambiguous instruction by picking up Key 2. In contrast, a literal instruction follower might pick up Key 1 or Key 4, which are also red in color.
  • Figure 2: Model architecture. In CLIPS, we model the human as a cooperative planner who computes a joint policy $\pi$ for a goal $g \in G$. The policy $\pi$ dictates the human's and assistant's actions $a^{_h}_t, a^{_r}_t$ at each state $s_t$, as well as the command $c_t$ and utterance $u_t$ that the human may decide $d_t$ to communicate at step $t$. One realization of this process is depicted in (a), showing a case where an utterance $u_3$ is only made at $t=3$. We implement this process as probabilistic program, shown in (b). Utterance generation is modeled by the subroutine in (c), which summarizes salient actions from policy $\pi$ as a command $c_t$, then samples an utterance $u_t$ using a (large) language model prompted with $c_t$ and (d) a list of few-shot examples $\mathcal{E}$ demonstrating how commands are translated into natural language.
  • Figure 3: Example goal assistance problem in VirtualHome, where the principal and assistant collaborate to set the dinner table. The principal places three plates on the table, then says "Could you get the forks and knives?". A pragmatic assistant has to infer the number of forks and knives from context (in this case, three each).
  • Figure 4: Goal assistance problems in Doors, Keys & Gems. Each sub-figure contains a visual (left), instruction (bottom left), goal posteriors produced by each method (top right), and the probability of a key or door appearing in the assistive plans generated by each method (bottom right). Our pragmatic goal assistance method, CLIPS, best matches the goal inferences and assistance options produced by human raters (averaged across raters). In contrast, language and action-only inverse planning (Lang. IP & Act. IP) have higher goal uncertainty, the literal baselines fail to resolve instruction ambiguity, and GPT-4V often produces incoherent responses.
  • Figure B1: GPT-4V zero-shot prompt for an example key assistance problems in Doors, Keys & Gems.
  • ...and 4 more figures