Table of Contents
Fetching ...

Large Language Models for Virtual Human Gesture Selection

Parisa Ghanad Torshizi, Laura B. Hensel, Ari Shapiro, Stacy C. Marsella

TL;DR

The paper investigates automating co-speech gesture selection for embodied virtual agents by leveraging GPT-4 to (i) select semantically appropriate gestures and (ii) determine when to gesture via rheme/theme discourse analysis. It defines a gestural-intent taxonomy tied to image schemas, evaluates multiple prompting strategies on a labor-activist speech dataset, and integrates the approach into the SIMA framework with a end-to-end pipeline from utterance to BML to SmartBody animation. Key findings show that prompting with gestural intents and examples improves gesture appropriateness and alignment with speaker behavior, while rheme/theme analysis effectively guides timing; however, real-time performance remains challenging, motivating future work on smaller models and latency reduction. The work demonstrates a viable path for controllable, semantically meaningful nonverbal behavior in virtual agents, with practical impact on human-agent communication and engagement.

Abstract

Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.

Large Language Models for Virtual Human Gesture Selection

TL;DR

The paper investigates automating co-speech gesture selection for embodied virtual agents by leveraging GPT-4 to (i) select semantically appropriate gestures and (ii) determine when to gesture via rheme/theme discourse analysis. It defines a gestural-intent taxonomy tied to image schemas, evaluates multiple prompting strategies on a labor-activist speech dataset, and integrates the approach into the SIMA framework with a end-to-end pipeline from utterance to BML to SmartBody animation. Key findings show that prompting with gestural intents and examples improves gesture appropriateness and alignment with speaker behavior, while rheme/theme analysis effectively guides timing; however, real-time performance remains challenging, motivating future work on smaller models and latency reduction. The work demonstrates a viable path for controllable, semantically meaningful nonverbal behavior in virtual agents, with practical impact on human-agent communication and engagement.

Abstract

Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.

Paper Structure

This paper contains 23 sections, 3 figures.

Figures (3)

  • Figure 1: comparison of different prompting approaches in terms of their appropriateness
  • Figure 2: comparison of different prompting approaches in terms of their alignment with the speaker
  • Figure 3: LLM Approach to selecting gestures. Based on the type of approach, the input of the LLM can contain the context, annotation (examples), or gesture knowledge (gesture description)