Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Huan Yang; Jiahui Chen; Chaofan Ding; Runhua Shi; Siyu Xiong; Qingqi Hong; Xiaoqi Mo; Xinhan Di

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Huan Yang, Jiahui Chen, Chaofan Ding, Runhua Shi, Siyu Xiong, Qingqi Hong, Xiaoqi Mo, Xinhan Di

TL;DR

This work explores the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features.

Abstract

Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

TL;DR

Abstract

Paper Structure (17 sections, 11 equations, 5 figures, 3 tables)

This paper contains 17 sections, 11 equations, 5 figures, 3 tables.

Introduction
Method
Stage 1: Base Model Learning
Image encode.
Feature Enhancer.
Self-Supervised Deviation Module.
Latent Deviation Extractor.
Warping Calculator.
Latent Deviation Decoder.
Training.
Stage 2: Latent Motion Diffusion
Feature Priors and Loss.
Experiments and Results
Dataset and Evaluation metrics.
Evaluation on Results(First Stage)
...and 2 more sections

Figures (5)

Figure 1: Examples of our generated gesture videos. White dashed arrows indicate gestures corresponding to bold words. The red dotted boxes indicate the mouth shapes corresponding to the italicized words.
Figure 2: Co-speech gesture video generation pipeline of our proposed method consists of three main components: 1) the latent deviation extractor (orange) 2) the latent deviation decoder (blue) 3) the latent motion diffusion (green).
Figure 3: The deviation in latent representation.
Figure 4: Visual comparison with SOTAs. Our method generates gestures with more extensive accurate motions (dashed boxes), matching audio and semantics. Red boxes indicate unrealistic gestures generated by ANGIE liu2022audio, MM-Diffusion ruan2022mmdiffusion and S2G-MDDiffusion he2024co.
Figure 5: Visualization results of fine-grained hand variations. The gesture videos we generate are clearer, more reasonable, more diverse and more natural in the same frame.

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

TL;DR

Abstract

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)