Table of Contents
Fetching ...

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

Anna Deichler, Simon Alexanderson, Jonas Beskow

TL;DR

The paper addresses the lack of spatial awareness in data-driven gesture generation for virtual agents by proposing scene-conditioned gestures. It presents a synthetic dataset that augments existing co-speech gesture data with pointing gestures and multimodal referencing expressions, enabling joint perception of language and scene. Key methods include extending the pointing gesture dataset via mirroring and time-stretching, generating exophoric references with GPT-4, synthesizing speech with a TTS engine, and aligning the modalities with the Hungarian algorithm. This dataset serves as a benchmark for developing embodied agents that gesture appropriately within spatial contexts, with future work addressing richer scene annotations and evaluation protocols.

Abstract

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents

TL;DR

The paper addresses the lack of spatial awareness in data-driven gesture generation for virtual agents by proposing scene-conditioned gestures. It presents a synthetic dataset that augments existing co-speech gesture data with pointing gestures and multimodal referencing expressions, enabling joint perception of language and scene. Key methods include extending the pointing gesture dataset via mirroring and time-stretching, generating exophoric references with GPT-4, synthesizing speech with a TTS engine, and aligning the modalities with the Hungarian algorithm. This dataset serves as a benchmark for developing embodied agents that gesture appropriately within spatial contexts, with future work addressing richer scene annotations and evaluation protocols.

Abstract

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.
Paper Structure (11 sections, 4 figures)

This paper contains 11 sections, 4 figures.

Figures (4)

  • Figure 1: Distribution of length of pointing motion clips and synthesized audio clips.
  • Figure 2: Alignment of demonstratives in motion and speech, with pre-and post-padding of audio clips.
  • Figure 3: Demonstrative location against total length (in seconds) for motion (BVH) and audio files.
  • Figure 4: Visualization of dataset examples containing synchronized clips of audio and gesture for (a) pointing and (b) beat gestures.