Table of Contents
Fetching ...

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

Nan Gao, Yihua Bao, Dongdong Weng, Jiayi Zhao, Jia Li, Yan Zhou, Pengfei Wan, Di Zhang

TL;DR

This work tackles semantic co-speech gesture generation by introducing SARGes, which uses an LLM-based intent chain reasoning mechanism constrained by a co-speech gesture ethogram to produce reliable gesture labels. A subsequent lightweight gesture-label generator is trained to map text to semantically grounded gestures, enabling real-time synthesis. While GPT-4 achieves higher semantic depth, the proposed Qwen-based model offers substantially faster inference at lower cost, demonstrating a practical trade-off for real-time virtual agents. The approach provides an interpretable, constraint-driven pathway for semantic gesture synthesis with potential for broader deployment in virtual humans and social robots.

Abstract

Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

TL;DR

This work tackles semantic co-speech gesture generation by introducing SARGes, which uses an LLM-based intent chain reasoning mechanism constrained by a co-speech gesture ethogram to produce reliable gesture labels. A subsequent lightweight gesture-label generator is trained to map text to semantically grounded gestures, enabling real-time synthesis. While GPT-4 achieves higher semantic depth, the proposed Qwen-based model offers substantially faster inference at lower cost, demonstrating a practical trade-off for real-time virtual agents. The approach provides an interpretable, constraint-driven pathway for semantic gesture synthesis with potential for broader deployment in virtual humans and social robots.

Abstract

Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Pipiline for Gesture Label Generation.
  • Figure 2: Guidelines Illustration for the 'Rub Hands' Gesture
  • Figure 3: Objective Evaluation Results
  • Figure 4: Gesture Label Generation Visualization. It is important to note that we did not further distinguish different categories (such as A, B, C, and D) within the gesture IDs in order to avoid increasing the difficulty of model training.