Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

Bosong Ding; Murat Kirtay; Giacomo Spigler

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

Bosong Ding, Murat Kirtay, Giacomo Spigler

TL;DR

A generative AI pipeline is employed to produce human-like head movements for a Nao humanoid robot and shows that the Nao robot successfully imitates human head movements in a natural manner while actively tracking the speakers during the conversation.

Abstract

Head movements are crucial for social human-human interaction. They can transmit important cues (e.g., joint attention, speaker detection) that cannot be achieved with verbal interaction alone. This advantage also holds for human-robot interaction. Even though modeling human motions through generative AI models has become an active research area within robotics in recent years, the use of these methods for producing head movements in human-robot interaction remains underexplored. In this work, we employed a generative AI pipeline to produce human-like head movements for a Nao humanoid robot. In addition, we tested the system on a real-time active-speaker tracking task in a group conversation setting. Overall, the results show that the Nao robot successfully imitates human head movements in a natural manner while actively tracking the speakers during the conversation. Code and data from this study are available at https://github.com/dingdingding60/Humanoids2024HRI

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 8 figures, 2 tables)

This paper contains 14 sections, 2 equations, 8 figures, 2 tables.

Introduction
Related Work
Methods
Head motion data collection
Head trajectory modelling
Generation of target trajectories
Case study: active-speaker gazing task
Preliminary experiment on human preferences
Results
Trajectory modelling and generation
Case study: active-speaker gazing task
Discussion and Conclusions
Supplementary Materials
Supplementary Results

Figures (8)

Figure 1: Schematics of the proposed method for generating head-gaze movements from human motion data (methods \ref{['sec:datacollection']}), along with the example application to an active-speaker gazing task. The models are obtained by first training a variational autoencoder to model human motion data (methods \ref{['sec:trajmodelling']}). Next, a multilayer perceptron (MLP) is trained to map end fixation points of the training trajectories into the corresponding latent vectors $\mathbf{z}$ learned by the encoder (methods \ref{['sec:traj_generation']}). The system can then be used by first converting desired fixation points into latent vectors, which are subsequently transformed into motion trajectories by the VAE decoder. In the active-speaker fixation task (methods \ref{['ASDsection']}), audiovisual input from a camera mounted on the head of a Nao robot is inputted to the Light-ASD liao2023light active-speaker recognition module to retrieve bounding boxes of faces together with a confidence score for each possible speaker. Fixations are then decided by sampling a softmax distribution of the confidence scores.
Figure 2: Overview of the setup used to collect head movement trajectories (Section \ref{['sec:datacollection']}). A single participant is asked to produce head-gaze movements between pairs of points organized as three uniform $3\times 3$ grids of points.
Figure 3: Full set of the $174$ human head-motion trajectories collected. Trajectories consist of the yaw and pitch of the head during movement, relative to the initial pose (i.e., $\boldsymbol{\tau}(t) = \left(\mathrm{yaw}(t), \mathrm{pitch}(t) \right)$, $\boldsymbol{\tau}(0) = (0,0)$.)
Figure 4: Analysis of the capabilities of the proposed method to synthesize realistic head-gaze movements. (a) We generate and plot $200$ random trajectories by sampling latent vectors from the VAE prior distribution $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)$, and compare them to the human motion data from Figure \ref{['traj_all']}. (b) We evaluate the capacity of the system to generate desired head-motion trajectories to look at target points. To do so, we select a set of $21 \times 21$ target fixations on a grid $[-1, 1] \times [-1, 1]$ (normalized yaw/pitch coordinates), and generate trajectories to reach each of them. The final fixation point for each trajectory is shown as a distortion grid. The MSE in original coordinates is approximately $3.7$ degrees. (c) A subset of $5\times 5$ fixation points on the same grid (colored in red in the middle panel) is selected to show the individual trajectories generated to reach each target. Trajectories are divided into segments (alternating in red and blue color) to show the angular velocities in the yaw and pitch directions at each timestep.
Figure 5: Example of head movements on a Nao robot using our proposed method versus the default controller. (a-b) Initial and final pictures during a Nao fixation movement. (c) We track the red marker on Nao's nose (in pixel coordinates) during the fixation movement, using our method versus the default controller. (d-e) We report encoder readings from the Nao's head showing the robot's yaw and pitch angles while moving to a target configuration (d) or executing a trajectory ('target') generated using our proposed method.
...and 3 more figures

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

TL;DR

Abstract

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

Authors

TL;DR

Abstract

Table of Contents

Figures (8)