Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Anna Deichler; Simon Alexanderson; Jonas Beskow

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Anna Deichler, Simon Alexanderson, Jonas Beskow

TL;DR

The paper addresses the lack of spatial awareness in data-driven gesture generation for virtual agents by proposing scene-conditioned gestures. It presents a synthetic dataset that augments existing co-speech gesture data with pointing gestures and multimodal referencing expressions, enabling joint perception of language and scene. Key methods include extending the pointing gesture dataset via mirroring and time-stretching, generating exophoric references with GPT-4, synthesizing speech with a TTS engine, and aligning the modalities with the Hungarian algorithm. This dataset serves as a benchmark for developing embodied agents that gesture appropriately within spatial contexts, with future work addressing richer scene annotations and evaluation protocols.

Abstract

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

TL;DR

Abstract

Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Introduction
Related work
Situated Gesture generation
Scene conditioning in human motion generation
Dataset generation
Extending the pointing gesture dataset
Text generation
Speech synthesis
Matching the generated speech segments with pointing gestures
Conclusions and future work
GPT-4 prompts

Figures (4)

Figure 1: Distribution of length of pointing motion clips and synthesized audio clips.
Figure 2: Alignment of demonstratives in motion and speech, with pre-and post-padding of audio clips.
Figure 3: Demonstrative location against total length (in seconds) for motion (BVH) and audio files.
Figure 4: Visualization of dataset examples containing synchronized clips of audio and gesture for (a) pointing and (b) beat gestures.

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

TL;DR

Abstract

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)