SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

Nan Gao; Yihua Bao; Dongdong Weng; Jiayi Zhao; Jia Li; Yan Zhou; Pengfei Wan; Di Zhang

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

Nan Gao, Yihua Bao, Dongdong Weng, Jiayi Zhao, Jia Li, Yan Zhou, Pengfei Wan, Di Zhang

TL;DR

This work tackles semantic co-speech gesture generation by introducing SARGes, which uses an LLM-based intent chain reasoning mechanism constrained by a co-speech gesture ethogram to produce reliable gesture labels. A subsequent lightweight gesture-label generator is trained to map text to semantically grounded gestures, enabling real-time synthesis. While GPT-4 achieves higher semantic depth, the proposed Qwen-based model offers substantially faster inference at lower cost, demonstrating a practical trade-off for real-time virtual agents. The approach provides an interpretable, constraint-driven pathway for semantic gesture synthesis with potential for broader deployment in virtual humans and social robots.

Abstract

Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures.First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

TL;DR

Abstract

SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)