Table of Contents
Fetching ...

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li, Shigeru Kuriyama, Takafumi Taketomi

TL;DR

TANGO tackles the challenge of producing realistic, audio-synchronized co-speech gesture videos by combining a graph-based gesture retrieval framework with a diffusion-based interpolation module. It introduces AuMoCLIP, a hierarchical audio-motion embedding that enables cross-modal retrieval, and ACInterp, a diffusion-driven interpolator that preserves appearance and reduces artifacts in transitions. A graph-pruning step ensures long, connected playback paths, while diffusion interpolation fills in non-existent transitions with high fidelity. Evaluations on Show-Oliver and YouTube Business datasets show improvements over prior methods in video quality and audio-gesture alignment, with strong qualitative and quantitative results. The work also provides open-source tools for motion graphs and audio-driven video generation, enabling future extensions to broader human-motion domains.

Abstract

We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures. TANGO builds on Gesture Video Reenactment (GVR), which splits and retrieves video clips using a directed graph structure - representing video frames as nodes and valid transitions as edges. We address two key limitations of GVR: audio-motion misalignment and visual artifacts in GAN-generated transition frames. In particular, (i) we propose retrieving gestures using latent feature distance to improve cross-modal alignment. To ensure the latent features could effectively model the relationship between speech audio and gesture motion, we implement a hierarchical joint embedding space (AuMoCLIP); (ii) we introduce the diffusion-based model to generate high-quality transition frames. Our diffusion model, Appearance Consistent Interpolation (ACInterp), is built upon AnimateAnyone and includes a reference motion module and homography background flow to preserve appearance consistency between generated and reference videos. By integrating these components into the graph-based retrieval framework, TANGO reliably produces realistic, audio-synchronized videos and outperforms all existing generative and retrieval methods. Our codes and pretrained models are available: \url{https://pantomatrix.github.io/TANGO/}

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

TL;DR

TANGO tackles the challenge of producing realistic, audio-synchronized co-speech gesture videos by combining a graph-based gesture retrieval framework with a diffusion-based interpolation module. It introduces AuMoCLIP, a hierarchical audio-motion embedding that enables cross-modal retrieval, and ACInterp, a diffusion-driven interpolator that preserves appearance and reduces artifacts in transitions. A graph-pruning step ensures long, connected playback paths, while diffusion interpolation fills in non-existent transitions with high fidelity. Evaluations on Show-Oliver and YouTube Business datasets show improvements over prior methods in video quality and audio-gesture alignment, with strong qualitative and quantitative results. The work also provides open-source tools for motion graphs and audio-driven video generation, enabling future extensions to broader human-motion domains.

Abstract

We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures. TANGO builds on Gesture Video Reenactment (GVR), which splits and retrieves video clips using a directed graph structure - representing video frames as nodes and valid transitions as edges. We address two key limitations of GVR: audio-motion misalignment and visual artifacts in GAN-generated transition frames. In particular, (i) we propose retrieving gestures using latent feature distance to improve cross-modal alignment. To ensure the latent features could effectively model the relationship between speech audio and gesture motion, we implement a hierarchical joint embedding space (AuMoCLIP); (ii) we introduce the diffusion-based model to generate high-quality transition frames. Our diffusion model, Appearance Consistent Interpolation (ACInterp), is built upon AnimateAnyone and includes a reference motion module and homography background flow to preserve appearance consistency between generated and reference videos. By integrating these components into the graph-based retrieval framework, TANGO reliably produces realistic, audio-synchronized videos and outperforms all existing generative and retrieval methods. Our codes and pretrained models are available: \url{https://pantomatrix.github.io/TANGO/}
Paper Structure (17 sections, 8 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: TANGO is a framework designed to generate co-speech body-gesture videos using a motion graph-based retrieval approach. It first retrieves most of the reference video clips that match the target speech audio by utilizing an implicit hierarchical audio-motion embedding space. Then, it adopts a diffusion-based interpolation network to generate the remaining transition frames and smooth the discontinuities at clip boundaries.
  • Figure 2: Limitations of GVR zhou2022audio.
  • Figure 3: System Pipeline of TANGO. TANGO generates gesture video in three steps. Firstly, it creates a directed motion graph to represent video frames as nodes and valid transitions as edges. Each sampled path (in bold) dictates the selected playback order. Secondly, an audio-conditioned gesture retrieval module aims to minimize cross-modal feature distance to find a path where gestures best match target audio. Lastly, a diffusion-based interpolation model generates appearance-consistent connection frames when the transition edges do not exist in the original reference video.
  • Figure 4: Graph Pruning. We delete paths with dead endpoints by merge SCC subgraphs. i.e., those ending with a node without out-degree in the initial Gesture Video Graph (left), and obtain a strongly connected subgraph (right). Each node in the pruned graph is reachable from any other node within this subgraph, enabling efficient sampling of long video. The color of the paths represents different reference video clips for one speaker.
  • Figure 5: AuMoCLIP. AuMoCLIP is a pipeline to train hierarchical joint embedding. The audio waveform and extracted 3D motions are encoded in a learned embedding space where paired audio and motion have a closer distance than non-paired samples. It employs dual-tower encoder architecture; each encoder is split into low and high-level sub-encoder. Besides, it includes the pretrained Wav2Vec2 and BERT features to make it work. The embedding is trained with a frame-wise and clip-wise contrastive loss for local and global cross-modal alignment, respectively. We design the frame-wise loss by frames within a close temporal window ($i \pm t$) are positive, while distant frames ($i - kt$, $i - t$) and ($i + t$, $i + kt$) are negative.
  • ...and 3 more figures