A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration
Valerio Belcamino, Nhat Minh Dinh Le, Quan Khanh Luu, Alessandro Carfì, Van Anh Ho, Fulvio Mastrogiovanni
TL;DR
This work tackles real-time human activity recognition in human-robot collaboration by integrating motion data from a TER glove with IMUs and tactile information from a vision-based TacLINK sensor. It introduces a three-branch transformer network (ViViT for the tactile video streams and HART for IMU data) with late fusion to classify 15 hand actions in both segmented offline and continuous online contexts, then demonstrates deployment on a UR5 robot in dynamic HRC tasks. The system achieves a high offline accuracy of $94.64\%$ (F1 $=95.60\%$), strong online performance ($83.92\%$ frame accuracy) with action-specific strengths and weaknesses, and a median reaction time of $3.54$ s in a dynamic scenario, showcasing the potential of multimodal sensing for safe and responsive collaboration. Practical impact includes improved safety and responsiveness in HRC through reliable recognition of hand-based interactions, with clear avenues for reducing latency and expanding action coverage through more diverse training data and orientation-aware features. $15$ actions, multimodal fusion, and three validation modes constitute the core contributions that advance tactile-vision sensing for real-time HAR in collaborative robotics.
Abstract
Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond to and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.
