Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion
Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov
TL;DR
The paper tackles real-time co-speech gesture synthesis by addressing the latency and coherence trade-offs of diffusion models. It introduces a universal Streaming Rolling Diffusion framework that turns diffusion-based gesture models into real-time streaming generators using a structured rolling window and per-frame time embeddings, and it further advances efficiency with Rolling Diffusion Ladder Acceleration (RDLA), which denoises multiple frames in parallel via a laddered noise schedule. Extensive experiments on ZEGGS and BEAT across multiple baselines show consistent improvements in realism and diversity, plus up to 4× faster inference with RDLA while maintaining temporal coherence. User studies corroborate the quantitative gains, validating the approach for realistic, synchronized co-speech gestures in interactive settings.
Abstract
Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4x speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework ability to generate realistic, diverse gestures closely synchronized with the audio input.
