Table of Contents
Fetching ...

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Carlo Mazzola, Marta Romeo, Francesco Rea, Alessandra Sciutti, Angelo Cangelosi

TL;DR

This work tackles the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells.

Abstract

Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

TL;DR

This work tackles the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells.

Abstract

Communicating shapes our social word. For a robot to be considered social and being consequently integrated in our social environment it is fundamental to understand some of the dynamics that rule human-human communication. In this work, we tackle the problem of Addressee Estimation, the ability to understand an utterance's addressee, by interpreting and exploiting non-verbal bodily cues from the speaker. We do so by implementing an hybrid deep learning model composed of convolutional layers and LSTM cells taking as input images portraying the face of the speaker and 2D vectors of the speaker's body posture. Our implementation choices were guided by the aim to develop a model that could be deployed on social robots and be efficient in ecological scenarios. We demonstrate that our model is able to solve the Addressee Estimation problem in terms of addressee localisation in space, from a robot ego-centric point of view.
Paper Structure (20 sections, 8 figures, 3 tables)

This paper contains 20 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustrative frames from Vernissage Dataset. Examples of multiparty HRI data recorded from the Nao robot's cameras. Some pictures show blurred faces for privacy reasons.
  • Figure 2: Illustration of a sequence. Aggregation of frames in a sequence of 0.8s and extraction of body poses and face images
  • Figure 3: Illustration of an utterance. The utterance is partitioned into sequences of 0.8s. Utterances were defined as speech intervals addressed to the same addressee and delimited by silence. Each utterance comprised at least one sequence.
  • Figure 4: Illustration of the Deep Neural Network for Addressee Estimation employing an intermediate fusion approach (Exp. 1a). Face images and body pose vectors are passed separately to two blocks of convolution, each including two 2D convolutional and one max-pooling layers. Then, the two embeddings resulting from fully connected layers are concatenated and sequences of 10 fused embeddings are passed to the LSTM layer. The output is provided after two others fully connected layers and a LogSoftMax layer. * represents LeakyReLU activation function.
  • Figure 5: Bar plots reporting performance of Addressee Estimation model in the four 3-class experiments. Results of the 10-fold cross-validation experiments (Exp. 1.a-b-c-d) are provided in terms of mean and standard deviation (error bar) of weighted F1-scores. On the y-axis the performance score is expressed in %.
  • ...and 3 more figures