Table of Contents
Fetching ...

Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

M. Hamza Mughal, Rishabh Dabral, Merel C. J. Scholman, Vera Demberg, Christian Theobalt

TL;DR

We address the challenge of generating semantically meaningful co-speech gestures, which traditional beat-focused models often fail to ground in linguistic meaning. We introduce RAG-Gesture, a diffusion-based gesture generator that uses Retrieval Augmented Generation to inject semantically relevant exemplars from a gesture database at inference time via Latent Initialization and Retrieval Guidance, without requiring training. The framework explicitly separates specification (what to gesture) from animation (how to gesture) and employs two retrieval strategies—LLM-based gesture type prediction and discourse-based retrieval—grounding gestures in linguistic structure. Evaluations on BEAT2 show state-of-the-art performance across multiple speakers, with extensive ablations demonstrating the value of local, semantically guided retrieval and controllable retrieval augmentation. The approach yields natural, semantically grounded gestures and can extend to task-specific gestural patterns such as referential or emotion-driven gestures, offering practical benefits for avatars and telepresence systems.

Abstract

Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.

Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

TL;DR

We address the challenge of generating semantically meaningful co-speech gestures, which traditional beat-focused models often fail to ground in linguistic meaning. We introduce RAG-Gesture, a diffusion-based gesture generator that uses Retrieval Augmented Generation to inject semantically relevant exemplars from a gesture database at inference time via Latent Initialization and Retrieval Guidance, without requiring training. The framework explicitly separates specification (what to gesture) from animation (how to gesture) and employs two retrieval strategies—LLM-based gesture type prediction and discourse-based retrieval—grounding gestures in linguistic structure. Evaluations on BEAT2 show state-of-the-art performance across multiple speakers, with extensive ablations demonstrating the value of local, semantically guided retrieval and controllable retrieval augmentation. The approach yields natural, semantically grounded gestures and can extend to task-specific gestural patterns such as referential or emotion-driven gestures, offering practical benefits for avatars and telepresence systems.

Abstract

Non-verbal communication often comprises of semantically rich gestures that help convey the meaning of an utterance. Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. Our neuro-explicit gesture generation approach is designed to produce semantic gestures grounded in interpretable linguistic knowledge. We achieve this by using explicit domain knowledge to retrieve exemplar motions from a database of co-speech gestures. Once retrieved, we then inject these semantic exemplar gestures into our diffusion-based gesture generation pipeline using DDIM inversion and retrieval guidance at the inference time without any need of training. Further, we propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence. Our comparative evaluations demonstrate the validity of our approach against recent gesture generation approaches. The reader is urged to explore the results on our project page.

Paper Structure

This paper contains 47 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Our RAG-Gesture approach produces semantically meaningful gestures by leveraging explicit knowledge to retrieve exemplar gestures from the sparse semantic data liu2022beat and guiding the diffusion-based generation process through Retrieval Augmentation.
  • Figure 2: Overview. Our approach retrieves example gestures on semantically important words in speech and inserts those examples into the generated gesture by using them to guide the generation.
  • Figure 3: RAG-Gesture Framework. Our approach leverages a diffusion model which predicts clean sample $\mathbf{\hat{z}}^{(0)}$ from noisy gesture sample $\mathbf{z}^{(t)}$. We then utilize retrieval algorithms (\ref{['subsec:retrieval-algos']}) to modify the gesture sampling at inference time by inserting the retrieved motion through Latent Initialization (\ref{['subsec:inversion']}) and further controlling the sampling process through Retrieval Guidance (\ref{['subsec:guidance']}). This results in a sampled motion which follows the semantic retrieval.
  • Figure 4: Retrieval Algorithms. Each algorithm parses the relevant semantic information (gesture types from LLM or discourse relations) and extracts gestures from a database by filtering examples using that information. Moreover, it also considers textual and prosodic context.
  • Figure 5: Results of Perceptual Evaluation. A2P: Audio2Photoreal ng2024audio2photoreal GT: Ground Truth. * denotes p-value $<0.05$
  • ...and 3 more figures