Table of Contents
Fetching ...

Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov

TL;DR

The paper tackles real-time co-speech gesture synthesis by addressing the latency and coherence trade-offs of diffusion models. It introduces a universal Streaming Rolling Diffusion framework that turns diffusion-based gesture models into real-time streaming generators using a structured rolling window and per-frame time embeddings, and it further advances efficiency with Rolling Diffusion Ladder Acceleration (RDLA), which denoises multiple frames in parallel via a laddered noise schedule. Extensive experiments on ZEGGS and BEAT across multiple baselines show consistent improvements in realism and diversity, plus up to 4× faster inference with RDLA while maintaining temporal coherence. User studies corroborate the quantitative gains, validating the approach for realistic, synchronized co-speech gestures in interactive settings.

Abstract

Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.

Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

TL;DR

The paper tackles real-time co-speech gesture synthesis by addressing the latency and coherence trade-offs of diffusion models. It introduces a universal Streaming Rolling Diffusion framework that turns diffusion-based gesture models into real-time streaming generators using a structured rolling window and per-frame time embeddings, and it further advances efficiency with Rolling Diffusion Ladder Acceleration (RDLA), which denoises multiple frames in parallel via a laddered noise schedule. Extensive experiments on ZEGGS and BEAT across multiple baselines show consistent improvements in realism and diversity, plus up to 4× faster inference with RDLA while maintaining temporal coherence. User studies corroborate the quantitative gains, validating the approach for realistic, synchronized co-speech gestures in interactive settings.

Abstract

Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.

Paper Structure

This paper contains 32 sections, 9 equations, 6 figures, 13 tables, 2 algorithms.

Figures (6)

  • Figure 1: Visualization of the rolling denoising process with parameters $T = 5$, $N = 5$, $n^{cont} = 1$, $s=1$
  • Figure 2: Rolling ladder steps $k$ (blue bottom squares) and $k+1$ (red upper squares) for the ladder step size $l=4$ with corresponding noise level values and frames in the rolling window $\mathbf{x}^{t_0^4}_j$. The hollow squares of the corresponding color show the initial positions of the noise levels for a ladder of step size $l=1$.
  • Figure 3: Qualitative Comparison on ZEGGS Dataset. Columns (a–c) are DiffuseStyleGesture, Taming, and PersonaGestor; the baseline is on the bottom row and our version on the top. Our approach generates a broader range of natural, diverse gestures that are tightly synchronized with the speech signal. Specifically, it responds to the negation “don’t” with a rejecting/withdrawing motion, whereas the baselines largely repeat neutral poses and miss this semantic cue.
  • Figure 4: User study results. Left: "Ours" means DSG rolling modification, "Theirs" means original DSG. In total $48.4\%$ of participants preferred our model while $36.3\%$ preferred original DSG. Right: RDLA user study results. In total $48.2\%$ of participants preferred DSG rolling model while $45.7\%$ preferred RDLA.
  • Figure 5: Qualitative Comparison on ZEGGS Dataset. Columns (a–c) show DiffuseStyleGesture, Taming, and PersonaGestor; the baseline is on the bottom row and our version on the top. Our method produces natural, dynamic gestures that are tightly aligned with the speech content. Notably, it uses an energetic hand movement to point toward the interlocutor on the word “you,” effectively emphasizing the rhetorical intent of the phrase “can you think for a second, that that’s a good idea,” while the baselines largely lack emphasis and gesture variation.
  • ...and 1 more figures