Infer Human's Intentions Before Following Natural Language Instructions

Yanming Wan; Yue Wu; Yiping Wang; Jiayuan Mao; Natasha Jaques

Infer Human's Intentions Before Following Natural Language Instructions

Yanming Wan, Yue Wu, Yiping Wang, Jiayuan Mao, Natasha Jaques

TL;DR

This work empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches, and proposes a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks.

Abstract

For AI agents to be helpful to humans, they should be able to follow natural language instructions to complete everyday cooperative tasks in human environments. However, real human instructions inherently possess ambiguity, because the human speakers assume sufficient prior knowledge about their hidden goals and intentions. Standard language grounding and planning methods fail to address such ambiguities because they do not model human internal goals as additional partially observable factors in the environment. We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative embodied tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We implement a set of Transformer-based models and evaluate them over a challenging benchmark, HandMeThat. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches. We also compare our implementation with strong baselines, including Chain of Thought prompting on the largest available pre-trained language models, and find that FISER provides better performance on the embodied social reasoning tasks under investigation, reaching the state-of-the-art on HandMeThat.

Infer Human's Intentions Before Following Natural Language Instructions

TL;DR

Abstract

Paper Structure (52 sections, 2 equations, 5 figures, 2 tables)

This paper contains 52 sections, 2 equations, 5 figures, 2 tables.

Introduction
Related Work
Grounded language learning
Collaborative communication
Goal recognition
Reasoning with intermediate steps
FISER: Follow Instructions with Social and Embodied Reasoning
Problem Formulation
Modeling the Human's Intentions
Step-wise Reasoning over Human Intentions
Social Reasoning: Robot's Task Recognition
Social Reasoning: Human's Plan Recognition.
Embodied Reasoning: Grounded Planning.
Transformer-based Model Implementation
Inputs
...and 37 more sections

Figures (5)

Figure 1: An example scenario where the human's natural language instruction ("Could you pass that from the sofa?") is inherently ambiguous. Standard language grounding and planning methods fail to resolve ambiguity. We propose FISER, which explicitly reasons about human's internal intentions as intermediate steps. The robot disamiguates the instruction into a concrete robot-understandable task in the social reasoning phase (Phase 1), and then accomplishes the grounded planning in the embodied reasoning phase (Phase 2). We further propose an optional enhancement to Phase 1 by explicitly recognize the human's overall plan first, and then infer what the human wants the robot to do.
Figure 2: Graph for our problem formulation and proposed method. White nodes are observable variables, while grey nodes are unobservable. The robot is given the trajectory $\tau_{t'}$, a final state $s_{t'}$, and an utterance $u$. We propose to explicitly model human's intentions by modeling the human's overall plan $G^h\in\mathcal{G}^h$ as a set of predicates $p_k$. We further assume that human selects a subgoal $p^*$ that needs help, and then specifies a robot's task $G^r$, which is the underlying intention of human when saying $u$.
Figure 3: The Transformer-based model has four parts of inputs, which are passed separately into different Transformer Encoder Layers, and interact with each other through a Modality Interaction module after each Transformer layer. The first $2N$ layers form the social reasoning phase and the last $N$ layers form the embodied reasoning phase. The embeddings at Layer $2N$ are used for recognizing robot's task, and the last layer embeddings are used for predicting actions.
Figure 4: Failure case analysis for state-of-the-art LLMs with CoT prompts following FISER framework versus Transformer-based models trained from scratch with the FISER framework over 100 data points on Level 2.
Figure 5: Success rates of GPT-4 Turbo under different settings given that different percentages of irrelevant objects are filtered out from the environment.

Infer Human's Intentions Before Following Natural Language Instructions

TL;DR

Abstract

Infer Human's Intentions Before Following Natural Language Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (5)