Table of Contents
Fetching ...

Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models

Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Katsushi Ikeuchi

TL;DR

The paper addresses initiating human-robot interaction by comparing multiple system-design patterns that fuse vision-language descriptors with large language model-based policies. Using GPT-4o and Phi-3 Vision, it constructs a test suite of 84 scenes drawn from JPL, EgoCom, and Stanford, including true/false and open-ended scenarios to probe how models balance human and robot priorities at the interaction onset. Key findings show that multimodal descriptor approaches can enable appropriate actions when the desired behavior is clear, with LLMs providing detailed verbal guidance and VLMs aiding perception, yet open-ended situations reveal remaining challenges in balancing competing priorities. The work advances practical design guidelines for real-time HRI and highlights where current LLM/VLM capabilities need refinement for robust, open-ended initiation.

Abstract

In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.

Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models

TL;DR

The paper addresses initiating human-robot interaction by comparing multiple system-design patterns that fuse vision-language descriptors with large language model-based policies. Using GPT-4o and Phi-3 Vision, it constructs a test suite of 84 scenes drawn from JPL, EgoCom, and Stanford, including true/false and open-ended scenarios to probe how models balance human and robot priorities at the interaction onset. Key findings show that multimodal descriptor approaches can enable appropriate actions when the desired behavior is clear, with LLMs providing detailed verbal guidance and VLMs aiding perception, yet open-ended situations reveal remaining challenges in balancing competing priorities. The work advances practical design guidelines for real-time HRI and highlights where current LLM/VLM capabilities need refinement for robust, open-ended initiation.

Abstract

In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.

Paper Structure

This paper contains 13 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: System for beginning a human-robot interaction. Different design patterns for the descriptor and policy are tested and compared in the experiments.
  • Figure 2: Overview of the test set with example videos/images.
  • Figure 3: Summary of findings from results. Parentheses indicate descriptor type. The VLM policy and the LLM policy with the combined descriptor generate the most appropriate actions in non-open-ended situations. However, for open-ended situations, the VLM policy prioritizes the robot situation and the LLM policy with the combined descriptor prioritizes the human situation and have challenges in balancing situational priority.