Table of Contents
Fetching ...

Active Listener: Continuous Generation of Listener's Head Motion Response in Dyadic Interactions

Bishal Ghosh, Emma Li, Tanaya Guha

TL;DR

This work introduces the task of generating continuous head motion response of a listener in response to the speaker’s speech in real time with a graph-based end-to-end crossmodal model that takes interlocutor’s speech audio as input and directly generates head pose angles of the listener in real time.

Abstract

A key component of dyadic spoken interactions is the contextually relevant non-verbal gestures, such as head movements that reflect a listener's response to the interlocutor's speech. Although significant progress has been made in the context of generating co-speech gestures, generating listener's response has remained a challenge. We introduce the task of generating continuous head motion response of a listener in response to the speaker's speech in real time. To this end, we propose a graph-based end-to-end crossmodal model that takes interlocutor's speech audio as input and directly generates head pose angles (roll, pitch, yaw) of the listener in real time. Different from previous work, our approach is completely data-driven, does not require manual annotations or oversimplify head motion to merely nods and shakes. Extensive evaluation on the dyadic interaction sessions on the IEMOCAP dataset shows that our model produces a low overall error (4.5 degrees) and a high frame rate, thereby indicating its deployability in real-world human-robot interaction systems. Our code is available at - https://github.com/bigzen/Active-Listener

Active Listener: Continuous Generation of Listener's Head Motion Response in Dyadic Interactions

TL;DR

This work introduces the task of generating continuous head motion response of a listener in response to the speaker’s speech in real time with a graph-based end-to-end crossmodal model that takes interlocutor’s speech audio as input and directly generates head pose angles of the listener in real time.

Abstract

A key component of dyadic spoken interactions is the contextually relevant non-verbal gestures, such as head movements that reflect a listener's response to the interlocutor's speech. Although significant progress has been made in the context of generating co-speech gestures, generating listener's response has remained a challenge. We introduce the task of generating continuous head motion response of a listener in response to the speaker's speech in real time. To this end, we propose a graph-based end-to-end crossmodal model that takes interlocutor's speech audio as input and directly generates head pose angles (roll, pitch, yaw) of the listener in real time. Different from previous work, our approach is completely data-driven, does not require manual annotations or oversimplify head motion to merely nods and shakes. Extensive evaluation on the dyadic interaction sessions on the IEMOCAP dataset shows that our model produces a low overall error (4.5 degrees) and a high frame rate, thereby indicating its deployability in real-world human-robot interaction systems. Our code is available at - https://github.com/bigzen/Active-Listener
Paper Structure (13 sections, 2 equations, 3 figures, 2 tables)

This paper contains 13 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We introduce the task of generating continuous head motion response of a listener solely based on speaker's speech in a dyadic interaction. Different from past work, we present a completely data-driven approach that generates 3D head pose sequence in real time.
  • Figure 2: Overview of our model to generate listener's head motion response from speaker's speech. Speech is represented as a cycle graph which uses a GNN-based encoder-decoder architecture to generate head motion in terms of a time series of head pose angles. The graph architecture produces a compact yet accurate model facilitating real time generation.
  • Figure 3: Sample results of generated head motion response (in terms of roll, pitch and yaw) using the proposed model with wav2vec2. Overall, the generated results closely approximate the ground truth.