Table of Contents
Fetching ...

Plan-and-Act using Large Language Models for Interactive Agreement

Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi

TL;DR

The paper addresses runtime action planning in human-robot interaction by leveraging large language models to generate plans on the fly. It introduces a plan-and-act skill that combines a bottom-up action set, an event-driven timing manager, and action-text inputs to the LLM, enabling balanced behavior between respecting human activity and pursuing robot goals. The Engage skill demonstrates the approach across four scenarios, achieving about 90% success, with second-stage timing and action-text guidance proving critical for consistency and responsiveness. This work suggests a scalable, generalizable framework for LLM-assisted runtime planning in HRI, reducing manual heuristics while raising questions about reliance versus guidance and future trajectory-level integrations.

Abstract

Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios.

Plan-and-Act using Large Language Models for Interactive Agreement

TL;DR

The paper addresses runtime action planning in human-robot interaction by leveraging large language models to generate plans on the fly. It introduces a plan-and-act skill that combines a bottom-up action set, an event-driven timing manager, and action-text inputs to the LLM, enabling balanced behavior between respecting human activity and pursuing robot goals. The Engage skill demonstrates the approach across four scenarios, achieving about 90% success, with second-stage timing and action-text guidance proving critical for consistency and responsiveness. This work suggests a scalable, generalizable framework for LLM-assisted runtime planning in HRI, reducing manual heuristics while raising questions about reliance versus guidance and future trajectory-level integrations.

Abstract

Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios.

Paper Structure

This paper contains 12 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A situation where a robot is trying to execute a "speak" but should re-plan its actions based on the runtime HRI situation.
  • Figure 2: Proposed skill design. The action in blue indicates the current action. The action in orange indicates the next action decided using the LLM and event manager. The skill begins from the starting action and changes the action based on the LLM's response until reaching the end action. Timing to call the LLM is managed by the event manager, which uses situational changes from the recognition modules and second-stage questions to decide the timing.
  • Figure 3: Bottom-up approach for registering new actions in the action set. The figure shows an example of creating a new action "eye contact" based on the response from the LLM, existing actions, and the NLP component.
  • Figure 4: Experiment results for the four test scenarios: (1) person-robot (a person sitting waited to talk with the robot), (2) person-object (a person talking on a phone when the robot comes), (3) person-environment (a person walking by not noticing the robot), (4) person-person (a person in a conversation then walking out). Timeline indicates start-to-end from left-to-right, with the duration of the scenario on the very right of the line. Engage Label refers to the time-sections of the ground-truth labels. Human Activity refers to the activity description published from the simulator. Gaze at Robot indicates the time-sections where a "looking toward" description is published from the simulator (otherwise "looking away" is published). LLM Trigger indicates the timings the LLM component is called by the event manager, where 1 indicates calling for the first-stage action decision, and 2 indicates calling for the second-stage timing decision. Y/N refers to the yes/no response for the chance of losing the person question, $T$[s] refers to the returned wait time where inf indicates an infinite wait. 1* indicates timings where the LLM was called after waiting the wait time. An action changing without the call of the LLM indicates that the previous call returned a sequence of actions (e.g., "The robot should stop, make eye contact, and respond with a friendly phrase.").