Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
Anna Deichler, Simon Alexanderson, Jonas Beskow
TL;DR
The paper addresses the lack of spatial awareness in data-driven gesture generation for virtual agents by proposing scene-conditioned gestures. It presents a synthetic dataset that augments existing co-speech gesture data with pointing gestures and multimodal referencing expressions, enabling joint perception of language and scene. Key methods include extending the pointing gesture dataset via mirroring and time-stretching, generating exophoric references with GPT-4, synthesizing speech with a TTS engine, and aligning the modalities with the Hungarian algorithm. This dataset serves as a benchmark for developing embodied agents that gesture appropriately within spatial contexts, with future work addressing richer scene annotations and evaluation protocols.
Abstract
This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.
