Table of Contents
Fetching ...

Enhancing Context-Aware Human Motion Prediction for Efficient Robot Handovers

Gerard Gómez-Izquierdo, Javier Laplaza, Alberto Sanfeliu, Anaís Garrell

TL;DR

This work targets real-time human motion prediction for handovers in human-robot collaboration. It introduces IntentMotion, a lightweight siMLPe-based framework with intention conditioning, an intention classifier, and task-specific losses to improve accuracy while preserving efficiency. On a handover dataset, the method achieves about 200x faster inference with roughly 3% of the parameters and reduces body L2 error from $0.355 m$ to $0.165 m$, while enhancing right-hand dynamics. The approach demonstrates practical real-time applicability for safe, natural HRC handovers and offers a path toward extending intention-aware prediction to broader HRC tasks.

Abstract

Accurate human motion prediction (HMP) is critical for seamless human-robot collaboration, particularly in handover tasks that require real-time adaptability. Despite the high accuracy of state-of-the-art models, their computational complexity limits practical deployment in real-world robotic applications. In this work, we enhance human motion forecasting for handover tasks by leveraging siMLPe [1], a lightweight yet powerful architecture, and introducing key improvements. Our approach, named IntentMotion incorporates intention-aware conditioning, task-specific loss functions, and a novel intention classifier, significantly improving motion prediction accuracy while maintaining efficiency. Experimental results demonstrate that our method reduces body loss error by over 50%, achieves 200x faster inference, and requires only 3% of the parameters compared to existing state-of-the-art HMP models. These advancements establish our framework as a highly efficient and scalable solution for real-time human-robot interaction.

Enhancing Context-Aware Human Motion Prediction for Efficient Robot Handovers

TL;DR

This work targets real-time human motion prediction for handovers in human-robot collaboration. It introduces IntentMotion, a lightweight siMLPe-based framework with intention conditioning, an intention classifier, and task-specific losses to improve accuracy while preserving efficiency. On a handover dataset, the method achieves about 200x faster inference with roughly 3% of the parameters and reduces body L2 error from to , while enhancing right-hand dynamics. The approach demonstrates practical real-time applicability for safe, natural HRC handovers and offers a path toward extending intention-aware prediction to broader HRC tasks.

Abstract

Accurate human motion prediction (HMP) is critical for seamless human-robot collaboration, particularly in handover tasks that require real-time adaptability. Despite the high accuracy of state-of-the-art models, their computational complexity limits practical deployment in real-world robotic applications. In this work, we enhance human motion forecasting for handover tasks by leveraging siMLPe [1], a lightweight yet powerful architecture, and introducing key improvements. Our approach, named IntentMotion incorporates intention-aware conditioning, task-specific loss functions, and a novel intention classifier, significantly improving motion prediction accuracy while maintaining efficiency. Experimental results demonstrate that our method reduces body loss error by over 50%, achieves 200x faster inference, and requires only 3% of the parameters compared to existing state-of-the-art HMP models. These advancements establish our framework as a highly efficient and scalable solution for real-time human-robot interaction.

Paper Structure

This paper contains 19 sections, 17 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Motion prediction based on detected intention. Both sequences share the same ground truth represented by a dark blue skeleton and a green right hand. Top: Prediction with collaborative intention. Middle: Prediction with non-collaborative intention. Red Circle: Represents the last frame of the robot's end effector. Bottom: Real handover sequence.
  • Figure 2: Overview of our siMLPe-based approach for human motion prediction.FC denotes a fully connected layer, LN represents layer normalization, Trans indicates a transpose operation, EMB denotes the embedding layer, and AVG represents the average pooling layer. DCT and IDCT correspond to the discrete cosine and inverse discrete cosine transformations, respectively. The MLP module (highlighted in pink) consists of 48 FC and LN layers repeated across the architecture. The green box represents how the intention classifier processes data before intention prediction.
  • Figure 3: The figure presents three top-view scenarios from the dataset. The human (right figure) and the robot (left figure) interact in different environments, with obstacles depicted as red squares. Dashed lines indicate human movement, while the robot follows corresponding target points.