Table of Contents
Fetching ...

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

Yong Xie, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang

TL;DR

ReCoM tackles the challenge of generating realistic, speech-synced human gestures with high fidelity and generalization. It introduces a RET built on ViT that incorporates Dynamic Embedding Regularization, coupled with Iterative Reconstruction Inference, Classifier-Free Guidance, and temporal smoothing to mitigate autoregressive errors and enable zero-shot generalization. The approach uses dual VQ-VAE codebooks for hands and body, and a non-autoregressive decoding framework to produce coherent, natural gestures. Experimental results on benchmark datasets show state-of-the-art performance, including a substantial Fréchet Gesture Distance improvement, with perceptual studies confirming higher user preference.

Abstract

We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fréchet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.

ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer

TL;DR

ReCoM tackles the challenge of generating realistic, speech-synced human gestures with high fidelity and generalization. It introduces a RET built on ViT that incorporates Dynamic Embedding Regularization, coupled with Iterative Reconstruction Inference, Classifier-Free Guidance, and temporal smoothing to mitigate autoregressive errors and enable zero-shot generalization. The approach uses dual VQ-VAE codebooks for hands and body, and a non-autoregressive decoding framework to produce coherent, natural gestures. Experimental results on benchmark datasets show state-of-the-art performance, including a substantial Fréchet Gesture Distance improvement, with perceptual studies confirming higher user preference.

Abstract

We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech. The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture to explicitly model co-speech motion dynamics. This architecture enables joint spatial-temporal dependency modeling, thereby enhancing gesture naturalness and fidelity through coherent motion synthesis. To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization, thereby improving the naturalness and fluency of zero-shot motion generation for unseen speech inputs. To mitigate inherent limitations of autoregressive inference, including error accumulation and limited self-correction, we propose an iterative reconstruction inference (IRI) strategy. IRI refines motion sequences via cyclic pose reconstruction, driven by two key components: (1) classifier-free guidance improves distribution alignment between generated and real gestures without auxiliary supervision, and (2) a temporal smoothing process eliminates abrupt inter-frame transitions while ensuring kinematic continuity. Extensive experiments on benchmark datasets validate ReCoM's effectiveness, achieving state-of-the-art performance across metrics. Notably, it reduces the Fréchet Gesture Distance (FGD) from 18.70 to 2.48, demonstrating an 86.7% improvement in motion realism. Our project page is https://yong-xie-xy.github.io/ReCoM/.

Paper Structure

This paper contains 17 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The gesture training and inference pipeline of our work. Given an audio input, our method aims to produce a high-fidelity gesture. We use the loss function $L_{VQ}$ to optimize the compositional VQ-VAEs, enabling them to learn discrete gesture representation. We carefully employ effective data processing strategies to optimize the gesture generator, enabling it to obtain high-fidelity results. In the Inference phase, $S_{i}$ denotes the $i$-th speech segment. In this phase, we use IRI and a temporal smoothing process. The RET model is our gesture generator. Additionally, $n$ denotes the number of iterations, which is a variable value depending on the input. The novel inference strategy further enhances the model's performance in a non-autoregressive way. Most of the variables mentioned in the paper are introduced in \ref{['sec:pipelineOverview']}, while the remaining variables are mentioned in \ref{['sec:FaceGenerator']}, \ref{['sec:GestureCodebook']}, and \ref{['sec:GestureGenerator']}.
  • Figure 2: Our ReCoM pipeline consists of three components: a face generator, a gesture generator and a compositional VQ-VAE. By inputting speech audio, we can obtain the corresponding gesture sequence. Among them, the face model is responsible for generating the facial movement sequence $\hat{\theta}_{face}^{1:T}$. The gesture generator aims to generate gesture indices with high confidence. These indices are then input to the compositional VQ-VAE for nearest neighbor search and decoding to obtain the gesture sequence $\hat{M}_{1:T}$.
  • Figure 3: For face generator, we choose encoder-decoder architecture.
  • Figure 4: Fusion module. We apply hybrid convolution (with a downsampling rate of 2 and a convolution kernel of $1 \times 1$) to fuse audio features and gesture features. The role of hybrid convolution here is to combine the audio and gesture features, enabling them to interact and form a unified feature representation. Then, we use intrinsic convolution to obtain the essential mixed features. Intrinsic convolution serves to downsample the mixed features into the latent space. This downsampling operation is crucial as it allows the ViT module to process the data without the need for an overly large number of parameters. Finally, we input the features into the ViT model by adopting a channel-wise strategy.
  • Figure 5: When receiving out-of-domain audio inputs, TalkSHOW exhibits a frozen motion for a number of seconds. The results of ProbTalk often show incoherence between two consecutive frames. Meanwhile, Habibie et al.'s method often generates gestures with overly large jitter amplitudes. Our ReCoM results instead remain natural. Please refer to supplementary materials for more video demos, with both in-domain and out-of-domain evaluation.
  • ...and 1 more figures