Table of Contents
Fetching ...

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion

Zeyu Zhao, Nan Gao, Zhi Zeng, Guixuan Zhang, Jie Liu, Shuwu Zhang

TL;DR

This work tackles the editing bottleneck in diffusion-based co-speech gesture generation by introducing diffusion inversion as a unified editing framework. It exploits intermediate noise reconstruction for high-level style-preserving edits and input-noise optimization for fine-grained, low-level adjustments, all without re-training the diffusion model. The approach is demonstrated across multiple editing tasks with subjective and objective validation, showing improved style preservation, editing accuracy, and human-likeness alongside acceptable latency. The method has practical implications for content creators and interactive avatars, enabling flexible, near real-time gesture editing on standard hardware.

Abstract

Diffusion models have shown great success in generating high-quality co-speech gestures for interactive humanoid robots or digital avatars from noisy input with the speech audio or text as conditions. However, they rarely focus on providing rich editing capabilities for content creators other than high-level specialized measures like style conditioning. To resolve this, we propose a unified framework utilizing diffusion inversion that enables multi-level editing capabilities for co-speech gesture generation without re-training. The method takes advantage of two key capabilities of invertible diffusion models. The first is that through inversion, we can reconstruct the intermediate noise from gestures and regenerate new gestures from the noise. This can be used to obtain gestures with high-level similarities to the original gestures for different speech conditions. The second is that this reconstruction reduces activation caching requirements during gradient calculation, making the direct optimization on input noises possible on current hardware with limited memory. With different loss functions designed for, e.g., joint rotation or velocity, we can control various low-level details by automatically tweaking the input noises through optimization. Extensive experiments on multiple use cases show that this framework succeeds in unifying high-level and low-level co-speech gesture editing.

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion

TL;DR

This work tackles the editing bottleneck in diffusion-based co-speech gesture generation by introducing diffusion inversion as a unified editing framework. It exploits intermediate noise reconstruction for high-level style-preserving edits and input-noise optimization for fine-grained, low-level adjustments, all without re-training the diffusion model. The approach is demonstrated across multiple editing tasks with subjective and objective validation, showing improved style preservation, editing accuracy, and human-likeness alongside acceptable latency. The method has practical implications for content creators and interactive avatars, enabling flexible, near real-time gesture editing on standard hardware.

Abstract

Diffusion models have shown great success in generating high-quality co-speech gestures for interactive humanoid robots or digital avatars from noisy input with the speech audio or text as conditions. However, they rarely focus on providing rich editing capabilities for content creators other than high-level specialized measures like style conditioning. To resolve this, we propose a unified framework utilizing diffusion inversion that enables multi-level editing capabilities for co-speech gesture generation without re-training. The method takes advantage of two key capabilities of invertible diffusion models. The first is that through inversion, we can reconstruct the intermediate noise from gestures and regenerate new gestures from the noise. This can be used to obtain gestures with high-level similarities to the original gestures for different speech conditions. The second is that this reconstruction reduces activation caching requirements during gradient calculation, making the direct optimization on input noises possible on current hardware with limited memory. With different loss functions designed for, e.g., joint rotation or velocity, we can control various low-level details by automatically tweaking the input noises through optimization. Extensive experiments on multiple use cases show that this framework succeeds in unifying high-level and low-level co-speech gesture editing.
Paper Structure (19 sections, 10 equations, 8 figures, 5 tables)

This paper contains 19 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Examples of (a) high-level editing: copying basic style for different speech, and (b) low-level editing: tweaking joint rotation in specified frames or symmetrizing left and right part of the body, etc.
  • Figure 2: Two key capabilities of invertible diffusion models: (a) intermediate noise reconstruction for high-level editing and (b) input noise optimization for low-level editing.
  • Figure 3: Demonstration of style-preserving regeneration. Compared to the baselines that produce much varied results for new speech with the same conditions (A), the proposed model gives more similar results to the original gestures in style (B). Input text is omitted. Zoom in for closer look and more intermediate frames. Same for Fig. \ref{['fig:frame_joint_editing']}, \ref{['fig:motion_range_editing']}, \ref{['fig:velocity_editing']}, and \ref{['fig:symmetry_editing']}.
  • Figure 4: Demonstration of frame-joint editing. The proposed method can complete the editing goal with modifications better merged into the original gestures (A), instead of a less human-like interpolation when edited manually (B). The baseline method does not perform as well due to low editing resolution (one key frame per second) and skeleton definition conversion (C). Input audio is omitted.
  • Figure 5: Demonstration of motion range editing. The proposed method regenerates more significant motion range changes compared to the manual editing (A). However more optimization steps does not necessarily lead to better results (B).
  • ...and 3 more figures