Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning
Tan Zhi-Xuan, Lance Ying, Vikash Mansinghka, Joshua B. Tenenbaum
TL;DR
CLIPS presents a Bayesian, language-grounded approach to pragmatic instruction following and goal assistance by treating humans as cooperative planners who communicate joint policies. It combines real-time planning with an LLM-based utterance model to infer goals from actions and language, then selects assistive actions by minimizing expected cost under the inferred posterior. In two domains, CLIPS outperforms baselines (including GPT-4V) in goal accuracy, assistive quality, and alignment with human judgments, demonstrating the value of grounding language in a principled theory of mind and cooperative planning. This work advances robust, uncertainty-aware human-AI collaboration and points to scalable extensions using probabilistic programming and information-gathering strategies for more trustworthy assistive systems.
Abstract
People often give instructions whose meaning is ambiguous without further context, expecting that their actions or goals will disambiguate their intentions. How can we build assistive agents that follow such instructions in a flexible, context-sensitive manner? This paper introduces cooperative language-guided inverse plan search (CLIPS), a Bayesian agent architecture for pragmatic instruction following and goal assistance. Our agent assists a human by modeling them as a cooperative planner who communicates joint plans to the assistant, then performs multimodal Bayesian inference over the human's goal from actions and language, using large language models (LLMs) to evaluate the likelihood of an instruction given a hypothesized plan. Given this posterior, our assistant acts to minimize expected goal achievement cost, enabling it to pragmatically follow ambiguous instructions and provide effective assistance even when uncertain about the goal. We evaluate these capabilities in two cooperative planning domains (Doors, Keys & Gems and VirtualHome), finding that CLIPS significantly outperforms GPT-4V, LLM-based literal instruction following and unimodal inverse planning in both accuracy and helpfulness, while closely matching the inferences and assistive judgments provided by human raters.
