HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

Esteve Valls Mascaro; Daniel Sliwowski; Dongheui Lee

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

Esteve Valls Mascaro, Daniel Sliwowski, Dongheui Lee

TL;DR

HOI anticipation addresses the need for proactive robot assistance in human–robot collaboration. The authors present HOI4ABOT, a transformer-based framework that uses Patch Merger, dual cross-attention Transformers, and Hydra multi-heads to detect and anticipate HOIs in video. They integrate Dynamic Movement Primitives for motion generation and Behavior Trees for planning, and validate on VidHOI with improvements of $1.76\%$ and $1.04\%$ in mAP for detection and anticipation, plus a $15.4\times$ speedup. Real-world experiments with a Franka Emika Panda demonstrate proactive pouring, reducing human waiting time and achieving $85\%$ success across 20 trials. These results highlight the practical potential of intention-reading for improving human–robot collaboration and point to domain-specific data and control-method enhancements for future work.

Abstract

Robots are becoming increasingly integrated into our lives, assisting us in various tasks. To ensure effective collaboration between humans and robots, it is essential that they understand our intentions and anticipate our actions. In this paper, we propose a Human-Object Interaction (HOI) anticipation framework for collaborative robots. We propose an efficient and robust transformer-based model to detect and anticipate HOIs from videos. This enhanced anticipation empowers robots to proactively assist humans, resulting in more efficient and intuitive collaborations. Our model outperforms state-of-the-art results in HOI detection and anticipation in VidHOI dataset with an increase of 1.76% and 1.04% in mAP respectively while being 15.4 times faster. We showcase the effectiveness of our approach through experimental results in a real robot, demonstrating that the robot's ability to anticipate HOIs is key for better Human-Robot Interaction. More information can be found on our project webpage: https://evm7.github.io/HOI4ABOT_page/

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

TL;DR

and

in mAP for detection and anticipation, plus a

speedup. Real-world experiments with a Franka Emika Panda demonstrate proactive pouring, reducing human waiting time and achieving

success across 20 trials. These results highlight the practical potential of intention-reading for improving human–robot collaboration and point to domain-specific data and control-method enhancements for future work.

Abstract

Paper Structure (20 sections, 8 figures, 6 tables)

This paper contains 20 sections, 8 figures, 6 tables.

Introduction
Related Works
Human Intention in Robotics
HOI Detection and Anticipation
Task and Motion Planning
Methodology
Human-Object Interaction
Motion generation and task planning
Experiments
Dataset and Metrics
Quantitative evaluation
Ablation study
Real World Experiments
Limitations
Conclusions
...and 5 more sections

Figures (8)

Figure 1: Overview of our HOI4ABOT framework. A robot leverages RGB data to detect and anticipate the human-object interactions in its surroundings and assist the human in a timely manner. The robot anticipates the human intention of holding the cup, so it prepares itself for pouring by grabbing the bottle. The robot reacts to the human holding the cup by pouring water.
Figure 2: HOI4ABOT architecture overview. We consider a video of $T+1$ frames with the pre-extracted object and human bounding boxes $\mathbf{B}^t$. Our module initially extracts relevant features per frame (left) to later on detect and anticipate HOIs (right) later. First, a ViT backbone oquab2023dinov2 extracts patch-based local $\mathbf{E}^t$ and global $\mathbf{cls}_t$ features per each frame $t$. Then, we obtain features per human $\mathbf{e}_n^t$ and object $\mathbf{e}_m^t$ by aligning $\mathbf{E}^t$ to their bounding boxes, as shown in light blue. We also project each $\mathbf{B}^t$ to $\hat{\mathbf{B}}^t$ using a box embedder posembed_fourier, and the object category to $\mathrm{s_m}$ using CLIP clip. Our Dual Transformer, shown in purple, leverages the human and object-constructed windows (sequences in red and blue respectively) through two cross-attention transformers, where $\mathrm{K}$ey, $\mathrm{Q}$uery, and $\mathrm{V}$alue are used in the attention mechanism. $\mathrm{q}$ is a learnable parameter to learn the evolution of the location in time. Finally, we project the enhanced last feature from the Human Blender to detect and anticipate HOIs at several time horizons $i_k^{\tau}$ in the future through our Hydra head (shown in light green).
Figure 3: Mean objective fluency metrics for pouring experiments for different confidence thresholds {0.3, 0.5, 0.7} in the HOIs prediction.
Figure 4: Real-world experiments scenario.
Figure 5: Human waiting time to be served the drink for different confidence thresholds ($\{0.3,$$0.5,$$0.7\}$ and anticipation heads $\tau_a=\{0,1,3,5\}$.
...and 3 more figures

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

TL;DR

Abstract

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)