Table of Contents
Fetching ...

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

Hassan Ali, Philipp Allgeuer, Stefan Wermter

TL;DR

The paper tackles predicting human intention in collaborative object categorization with a social robot by fusing verbal cues, non-verbal cues, and environmental context through a hierarchical LLM-driven framework. It introduces NICOL-based perception-grounded task reasoning, combining perceptive reasoning (non-verbal cues) and task reasoning (prompts and history) to generate robot actions. The approach is validated across a six-object categorization task with 150 trials, comparing GPT-4, GPT-3.5 variants, Vicuna, and Mistral; GPT-4 achieves the best performance and explainability. Findings indicate that LLMs can robustly reason over multimodal cues for intention prediction in real-time HRI, enabling more natural and proactive social robot interactions, with implications for rapid task adaptation in service environments.

Abstract

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

TL;DR

The paper tackles predicting human intention in collaborative object categorization with a social robot by fusing verbal cues, non-verbal cues, and environmental context through a hierarchical LLM-driven framework. It introduces NICOL-based perception-grounded task reasoning, combining perceptive reasoning (non-verbal cues) and task reasoning (prompts and history) to generate robot actions. The approach is validated across a six-object categorization task with 150 trials, comparing GPT-4, GPT-3.5 variants, Vicuna, and Mistral; GPT-4 achieves the best performance and explainability. Findings indicate that LLMs can robustly reason over multimodal cues for intention prediction in real-time HRI, enabling more natural and proactive social robot interactions, with implications for rapid task adaptation in service environments.

Abstract

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI
Paper Structure (13 sections, 5 figures, 2 tables)

This paper contains 13 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of our intention prediction system. An LLM reasons about the user's verbal (saying "I am hungry") and non-verbal (holding the can) cues to generate suitable actions complementary to those of the user, e.g., giving the bowl to the user.
  • Figure 2: An overview of our system hierarchy for intention prediction. Our method for intention prediction consists of perceptive reasoning of the user's non-verbal state and task reasoning which combines explicit user queries (user speech) and task prompts.
  • Figure 3: Perceptive reasoning of the user's non-verbal cues with examples. The user's hand, pose, and face are detected. Then, the corresponding non-verbal cues are recognized as textual tokens, passed to an LLM to generate contextually relevant outputs.
  • Figure 4: A concrete workflow example of the object categorization task. After the user moves an object to each side of the table, the robot assists in categorizing the remaining objects, e.g., the banana is sorted with the lemon since both are yellow fruits.
  • Figure 5: The main sources of system errors in the object categorization task. All LLM models showed good performance in perceptive reasoning. GPT-4 showed the highest performance overall, especially in explaining the decisions made during the interaction.