Gaze-Based Intention Recognition for Human-Robot Collaboration
Valerio Belcamino, Miwa Takase, Mariya Kilina, Alessandro Carfì, Akira Shimada, Sota Shimizu, Fulvio Mastrogiovanni
TL;DR
The paper addresses the challenge of online human intent recognition in human-robot collaboration during assembly tasks. It compares gaze-based intention estimation using a headset-based eye tracker and Unreal Engine against an IMU-based LSTM classifier, both embedded in a Hierarchical Task Network planning framework. Results show that gaze-based perception achieves performance comparable to IMUs in effectiveness and user satisfaction, with trade-offs in hardware and processing; idle and total assembly times differ but reflect similar coordination capabilities. The work highlights the potential of gaze as a viable, lower-sensor option and suggests future fusion of modalities to handle more complex and anticipatory human actions.
Abstract
This work aims to tackle the intent recognition problem in Human-Robot Collaborative assembly scenarios. Precisely, we consider an interactive assembly of a wooden stool where the robot fetches the pieces in the correct order and the human builds the parts following the instruction manual. The intent recognition is limited to the idle state estimation and it is needed to ensure a better synchronization between the two agents. We carried out a comparison between two distinct solutions involving wearable sensors and eye tracking integrated into the perception pipeline of a flexible planning architecture based on Hierarchical Task Networks. At runtime, the wearable sensing module exploits the raw measurements from four 9-axis Inertial Measurement Units positioned on the wrists and hands of the user as an input for a Long Short-Term Memory Network. On the other hand, the eye tracking relies on a Head Mounted Display and Unreal Engine. We tested the effectiveness of the two approaches with 10 participants, each of whom explored both options in alternate order. We collected explicit metrics about the attractiveness and efficiency of the two techniques through User Experience Questionnaires as well as implicit criteria regarding the classification time and the overall assembly time. The results of our work show that the two methods can reach comparable performances both in terms of effectiveness and user preference. Future development could aim at joining the two approaches two allow the recognition of more complex activities and to anticipate the user actions.
