Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task
Hassan Ali, Philipp Allgeuer, Stefan Wermter
TL;DR
The paper tackles predicting human intention in collaborative object categorization with a social robot by fusing verbal cues, non-verbal cues, and environmental context through a hierarchical LLM-driven framework. It introduces NICOL-based perception-grounded task reasoning, combining perceptive reasoning (non-verbal cues) and task reasoning (prompts and history) to generate robot actions. The approach is validated across a six-object categorization task with 150 trials, comparing GPT-4, GPT-3.5 variants, Vicuna, and Mistral; GPT-4 achieves the best performance and explainability. Findings indicate that LLMs can robustly reason over multimodal cues for intention prediction in real-time HRI, enabling more natural and proactive social robot interactions, with implications for rapid task adaptation in service environments.
Abstract
Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI
