A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

Valerio Belcamino; Nhat Minh Dinh Le; Quan Khanh Luu; Alessandro Carfì; Van Anh Ho; Fulvio Mastrogiovanni

A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

Valerio Belcamino, Nhat Minh Dinh Le, Quan Khanh Luu, Alessandro Carfì, Van Anh Ho, Fulvio Mastrogiovanni

TL;DR

This work tackles real-time human activity recognition in human-robot collaboration by integrating motion data from a TER glove with IMUs and tactile information from a vision-based TacLINK sensor. It introduces a three-branch transformer network (ViViT for the tactile video streams and HART for IMU data) with late fusion to classify 15 hand actions in both segmented offline and continuous online contexts, then demonstrates deployment on a UR5 robot in dynamic HRC tasks. The system achieves a high offline accuracy of $94.64\%$ (F1 $=95.60\%$), strong online performance ($83.92\%$ frame accuracy) with action-specific strengths and weaknesses, and a median reaction time of $3.54$ s in a dynamic scenario, showcasing the potential of multimodal sensing for safe and responsive collaboration. Practical impact includes improved safety and responsiveness in HRC through reliable recognition of hand-based interactions, with clear avenues for reducing latency and expanding action coverage through more diverse training data and orientation-aware features. $15$ actions, multimodal fusion, and three validation modes constitute the core contributions that advance tactile-vision sensing for real-time HAR in collaborative robotics.

Abstract

Human activity recognition (HAR) is fundamental in human-robot collaboration (HRC), enabling robots to respond to and dynamically adapt to human intentions. This paper introduces a HAR system combining a modular data glove equipped with Inertial Measurement Units and a vision-based tactile sensor to capture hand activities in contact with a robot. We tested our activity recognition approach under different conditions, including offline classification of segmented sequences, real-time classification under static conditions, and a realistic HRC scenario. The experimental results show a high accuracy for all the tasks, suggesting that multiple collaborative settings could benefit from this multi-modal approach.

A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

TL;DR

(F1

), strong online performance (

frame accuracy) with action-specific strengths and weaknesses, and a median reaction time of

s in a dynamic scenario, showcasing the potential of multimodal sensing for safe and responsive collaboration. Practical impact includes improved safety and responsiveness in HRC through reliable recognition of hand-based interactions, with clear avenues for reducing latency and expanding action coverage through more diverse training data and orientation-aware features.

actions, multimodal fusion, and three validation modes constitute the core contributions that advance tactile-vision sensing for real-time HAR in collaborative robotics.

Abstract

Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Introduction
Methodology
Experimental Setup
Offline Validation
Online Validation
Dynamic Validation
Results
Offline Validation
Online Validation
Dynamic Validation
Conclusions

Figures (5)

Figure 1: The user pinches the TacLINK connected to the UR5 robot while wearing the TER glove.
Figure 2: The adopted neural network architecture is composed of three separate branches merged in the last layer. The top branch, based on HART, takes as input the raw data from the IMUs, and the other two, based on the ViViT, work on the video streams from the TacLINK.
Figure 3: The classification of a recording segment from the continuous dataset. The first row defines the ground truth labels, while the second one depicts the output of the classifier. The last row shows the event-based error metrics described in Section \ref{['Experimental Setup']}. For visual clarity, the idle action is not associated to a color and it is represented by the white spaces between the blocks.
Figure 4: Snapshots from a Dynamic Validation trial. The user waits while the robot follows its trajectory (top left), receives instructions and approaches the robot (top right), performs the required action on the TacLINK (bottom left), and returns to rest after the model recognises the action (bottom right).
Figure 5: The plot refers to the Dynamic Validation of our system. Each box describes the distribution of the classification time for each label. The black line represents the median value, while the black circles represent outliers

A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

TL;DR

Abstract

A Distributed Multi-Modal Sensing Approach for Human Activity Recognition in Real-Time Human-Robot Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)