Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

Esteve Valls Mascaro; Yashuai Yan; Dongheui Lee

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

Esteve Valls Mascaro, Yashuai Yan, Dongheui Lee

TL;DR

The paper tackles the challenge of integrating robots into social environments by learning a shared latent space for humans and robots and forecasting social motion. It introduces ECHO, a transformer-based two-stage framework that first predicts individual motions and then refines them with social context, conditioned on textual interaction intents. The approach achieves state-of-the-art results on large, dyadic social motion datasets (InterGen) and human-robot collaboration data (CHICO), while supporting real-time inference and cross-robot retargeting. The work advances natural, controllable human-robot interactions through a unified latent representation and iterative, attention-driven synthesis of socially compliant robot behavior.

Abstract

Integrating robots into populated environments is a complex challenge that requires an understanding of human social dynamics. In this work, we propose to model social motion forecasting in a shared human-robot representation space, which facilitates us to synthesize robot motions that interact with humans in social scenarios despite not observing any robot in the motion training. We develop a transformer-based architecture called ECHO, which operates in the aforementioned shared space to predict the future motions of the agents encountered in social scenarios. Contrary to prior works, we reformulate the social motion problem as the refinement of the predicted individual motions based on the surrounding agents, which facilitates the training while allowing for single-motion forecasting when only one human is in the scene. We evaluate our model in multi-person and human-robot motion forecasting tasks and obtain state-of-the-art performance by a large margin while being efficient and performing in real-time. Additionally, our qualitative results showcase the effectiveness of our approach in generating human-robot interaction behaviors that can be controlled via text commands. Webpage: https://evm7.github.io/ECHO/

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 4 figures, 3 tables)

This paper contains 36 sections, 6 equations, 4 figures, 3 tables.

INTRODUCTION
RELATED WORK
Human Motion Forecasting
Motion retargeting in robotics
Human-Robot Interaction (HRI)
METHODOLOGY
Problem Formulation
Social Motion Forecasting
Motion Forecasting as Refinement
Pose Encoder
Single-Motion Encoder
Multiple motion forecasting
Pose Decoder
Losses
Single ($\mathcal{L}_{ind}$) and Social ($\mathcal{L}_{soc}$) Skeleton Losses
...and 21 more sections

Figures (4)

Figure 1: Overview of our ECHO framework. First, we learn how to encode ($E$) and decode ($D$) the JVRC-1 robot jvrc_robot (in the top left, ${R1}$) and the TIAGo++ robot (in the bottom left, ${R2}$) to a latent representation shared with a human (${H}$) while preserving its semantics. Then, we take advantage of this shared space in the social motion forecasting task. Our Single Encoder learns the dynamics of single agents given a textual intention and its past observations. Later, we iteratively refine those motions based on the social context of the surrounding agents using the Social Decoder. Our overall framework can decode the robot's motion in a social environment, closing the gap for natural and accurate Human-Robot Interaction.
Figure 2: Overview of our ECHO architecture. Our model first focuses on synthesizing individual human motions. First, we pad the observed motion $[\mathbf{p}^i_{t},\cdots,\mathbf{p}^i_{N}]$ for the $i$-th human by repeating the current pose $\mathbf{p}^i_{N}$ and obtain $\mathbf{X}^i_{ind}$. As our model is conditioned on the social interaction type $a$ and $\mathbf{X}^i_{ind}$, we encode them both and concatenated them to build $\mathbf{\bar{E}}_{ind}^i$. Then, we forecast our individual motions through a self-attention transformer followed by a Temporal MLP with $k$ layers, such that we obtain a single-motion representation $\mathbf{\hat{E}}_{ind}^i$. As we are considering a social scenario, we iteratively refine the motions per human 0 given the human 1 using cross-attention, and vice versa, obtaining $\mathbf{\hat{E}}_{soc}^0$ and $\mathbf{\hat{E}}_{soc}^1$. This refinement is repeated $K$ times. Finally, we decode each $\mathbf{\hat{E}}_{soc}^i$ and sum the last observed pose $\mathbf{p}^i_{N}$ to make the model invariant to global translations.
Figure 3: Social motion forecasting for Human-Robot Interaction. Human-Human pair represents the ground truth, while the human-robot pair represents the forecasted human-robot interaction.
Figure 4: Qualitative results for social motion forecasting in the InterGen intergen dataset. Each scenario shows the ground-truth human pair (left) and the predicted (right) per each time horizon.

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

TL;DR

Abstract

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)