Table of Contents
Fetching ...

Simulating User Agents for Embodied Conversational-AI

Daniel Philipov, Vardhan Dongre, Gokhan Tur, Dilek Hakkani-Tür

TL;DR

The feasibility of the proposed large language model (LLM)-based user agent approach for assessing and enhancing the effectiveness of robot task completion through natural language communication is showcased.

Abstract

Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.

Simulating User Agents for Embodied Conversational-AI

TL;DR

The feasibility of the proposed large language model (LLM)-based user agent approach for assessing and enhancing the effectiveness of robot task completion through natural language communication is showcased.

Abstract

Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.

Paper Structure

This paper contains 23 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A depiction of the framework that includes a user simulator interacting with an embodied agent to complete a task given as a user goal.
  • Figure 2: A sample session from TEACh dataset. At each step, either the user or the embodied agent takes an action. Images at the bottom show the egocentric views captured by the robot after the action of that time step is executed.
  • Figure 3: Distribution of F-1 Scores across different Dialogue Acts. Robot-only dialogue acts are omitted.
  • Figure 4: Impact of Move Actions on GPT-4 Performance. The figure illustrates the effect of including move actions user behavior modeled by GPT-4 in this case. The results are evaluated using Speak-F1 (blue) and DA Accuracy (green) metrics.
  • Figure 5: Confusion Matrices illustrating GPT-4’s performance in simulating user behavior across four conditions: zero-shot and few-shot learning with and without move actions.