ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

Yong Xie; Yunlian Sun; Hongwen Zhang; Yebin Liu; Jinhui Tang

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

Yong Xie, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang

TL;DR

ReCoM tackles the challenge of generating realistic, speech-synced human gestures with high fidelity and generalization. It introduces a RET built on ViT that incorporates Dynamic Embedding Regularization, coupled with Iterative Reconstruction Inference, Classifier-Free Guidance, and temporal smoothing to mitigate autoregressive errors and enable zero-shot generalization. The approach uses dual VQ-VAE codebooks for hands and body, and a non-autoregressive decoding framework to produce coherent, natural gestures. Experimental results on benchmark datasets show state-of-the-art performance, including a substantial Fréchet Gesture Distance improvement, with perceptual studies confirming higher user preference.

Abstract

We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fréchet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

TL;DR

Abstract

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)